CN109034398A

CN109034398A - Feature selection approach, device and storage medium based on federation's training

Info

Publication number: CN109034398A
Application number: CN201810918867.3A
Authority: CN
Inventors: 成柯葳; 范涛; 刘洋; 陈天健; 杨强
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2018-08-10
Filing date: 2018-08-10
Publication date: 2018-12-18
Anticipated expiration: 2038-08-10
Also published as: CN109034398B

Abstract

The invention discloses a kind of feature selection approach based on federation's training, the following steps are included: carrying out federal training using the training sample that XGboost algorithm is aligned two, tree-model is promoted to construct gradient, wherein, it includes more regression trees that the gradient, which promotes tree-model, and a split vertexes of the regression tree correspond to a feature of training sample；The average yield value that the gradient promotes the corresponding split vertexes of same feature in tree-model is counted, and using the average yield value as the scoring of character pair；Scoring based on each feature carries out feature ordering and exports ranking results, for carrying out feature selecting, wherein if there is the feature for not corresponding to split vertexes in training sample, this feature uses default scoring.The invention also discloses a kind of feature selecting devices and computer readable storage medium based on federation's training.The present invention, which is realized, carries out federal training modeling using the training sample of different data side, and then realizes the feature selecting of multi-party sample data.

Description

Feature selection approach, device and storage medium based on federation's training

Technical field

The present invention relates to machine learning techniques field more particularly to a kind of feature selection approach based on federation's training, dress It sets and computer readable storage medium.

Background technique

Certain behaviors in current information epoch, people can be come out by Data Representation, such as consumer behavior, thus derivative Big data analysis is gone out, corresponding Analysis model of network behaviors has been constructed by machine learning, and then can classify to the behavior of people Or the behavioural characteristic based on user is predicted etc..

Usually all it is that stand-alone training is carried out to sample data by a side in existing machine learning techniques, that is to say that folk prescription is built Mould.Meanwhile the mathematical model based on foundation, it may be determined that the feature that sample characteristics concentrate significance level relatively high.However very In the big data analysis scene in multispan field, such as the existing consumer behavior of user, also there is a lend-borrow action, and consumer consumption behavior number According to generation in consumer service provider, and user's lend-borrow action data are generated in financial service provider, if financial service mentions Supplier needs the lend-borrow action of the prediction user of the consumer behavior feature based on user, then needs disappearing using consumer service provider Expense behavioral data simultaneously carries out machine learning together with the lend-borrow action data of we to construct prediction model.

Therefore, for above-mentioned application scenarios, a kind of new modeling pattern is needed to realize the sample of different data provider The joint training of data, and then realize that both sides participate in modeling jointly.

Summary of the invention

The main purpose of the present invention is to provide a kind of feature selection approach, device and computers based on federation's training can Read storage medium, it is intended to which solving the prior art cannot achieve the joint training of sample data of different data provider, Jin Erwu Method realizes the technical issues of both sides participate in modeling jointly.

To achieve the above object, the present invention provides a kind of feature selection approach based on federation's training, described based on federation Trained feature selection approach the following steps are included:

Federal training is carried out using the training sample that XGboost algorithm is aligned two, promotes tree-model to construct gradient, Wherein, it includes more regression trees that the gradient, which promotes tree-model, and a split vertexes of the regression tree correspond to training sample One feature；

It counts the gradient and promotes the average yield value of the corresponding split vertexes of same feature in tree-model, and described will put down Equal scoring of the financial value as character pair；

Scoring based on each feature carries out feature ordering and exports ranking results, for carrying out feature selecting, wherein if instruction Practice the feature for existing in sample and not corresponding to split vertexes, then this feature uses default scoring.

Optionally, the training sample of described two alignment is respectively the first training sample and the second training sample；

The first training sample attribute includes sample ID and part sample characteristics, the second training sample attribute packet Include sample ID, another part sample characteristics and data label；

First training sample is provided by the first data side and is stored in the first data side local, the second training sample This is provided by the second data side and is stored in the second data side local.

Optionally, the training sample being aligned using XGboost algorithm to two carries out federal training, to construct gradient Promoting tree-model includes:

In second data side side, the First-order Gradient of each training sample in the corresponding sample set of epicycle node split is obtained With second order gradient；

If epicycle node split is the first run node split for constructing regression tree, to the First-order Gradient and two ladder Degree is sent to the first data side after being encrypted together with the sample ID of the sample set, in the first data side The First-order Gradient and the second order gradient of the side group in encryption calculate local training sample corresponding with the sample ID every The financial value of split vertexes under a kind of divisional mode；

If epicycle node split is the non-first run node split for constructing regression tree, the sample ID of the sample set is sent To the first data side, in first data side lateral edge First-order Gradient used in first run node split and second order Gradient calculates the financial value of local training sample split vertexes under each divisional mode corresponding with the sample ID；

Second data side receives the encryption financial value for all split vertexes that the first data side returns and is decrypted；

The local and sample is calculated based on the First-order Gradient and the second order gradient in second data side side The financial value of the corresponding training sample of ID split vertexes under each divisional mode；

Based on the financial value of the respective calculated all split vertexes of both sides, best point of the overall situation of epicycle node split is determined Split node；

The best split vertexes of the overall situation based on epicycle node split, divide the corresponding sample set of present node, raw The node of Cheng Xin is to construct the regression tree that gradient promotes tree-model.

Optionally, described in second data side side, obtain each trained sample in the corresponding sample set of epicycle node split Before the step of this First-order Gradient and second order gradient, further includes:

When carrying out node split, judge whether epicycle node split corresponds to first regression tree of construction；

If epicycle node split first regression tree of corresponding construction, judge whether epicycle node split is first recurrence of construction The first run node split of tree；

If epicycle node split is the first run node split for constructing first regression tree, in second data side side, just The First-order Gradient of each training sample and second order gradient in the corresponding sample set of beginningization epicycle node split；If epicycle node split is The non-first run node split for constructing first regression tree, then continue to use First-order Gradient used in first run node split and second order gradient；

If epicycle node split is corresponding to construct non-first regression tree, judge whether epicycle node split is construction non-first The first run node split of regression tree；

If epicycle node split is the first run node split for constructing non-first regression tree, more according to last round of federal training New First-order Gradient and second order gradient；If epicycle node split is the non-first run node split for constructing non-first regression tree, continue to use First-order Gradient used in first run node split and second order gradient.

Optionally, the feature selection approach based on federation's training further include:

When generating new node to construct the regression tree of gradient promotion tree-model, in second data side side, judgement Whether the depth of epicycle regression tree reaches predetermined depth threshold value；

If the depth of epicycle regression tree reaches the predetermined depth threshold value, Stop node division obtains gradient boosted tree Otherwise one regression tree of model continues next round node split.

When Stop node division, in second data side side, judge whether the total quantity of epicycle regression tree reaches pre- If amount threshold；

If the total quantity of epicycle regression tree reaches the preset quantity threshold value, stop federal training, otherwise continues next The federal training of wheel.

In second data side side, the related letter for the best split vertexes of the overall situation that each round node split determines is recorded Breath；

Wherein, the relevant information include: the provider of corresponding sample data, corresponding sample data feature coding and Financial value.

Optionally, the statistics gradient promotes the average yield value of the corresponding split vertexes of same feature in tree-model Include:

In second data side side, is promoted in tree-model using each global best split vertexes as the gradient and respectively returned The split vertexes of tree count the average yield value of the corresponding split vertexes of same feature coding.

Further, to achieve the above object, the present invention also provides a kind of feature selecting device based on federation's training, institutes The feature selecting device based on federation's training is stated to include memory, processor and be stored on the memory and can be described The feature selecting program run on processor realizes as above any one institute when the feature selecting program is executed by the processor The step of feature selection approach based on federation's training stated.

Further, to achieve the above object, the present invention also provides a kind of computer readable storage medium, the computers It is stored with feature selecting program on readable storage medium storing program for executing, as above any one is realized when the feature selecting program is executed by processor The step of described feature selection approach based on federation's training.

The present invention carries out federal training using the training sample that XGboost algorithm is aligned two, to construct gradient promotion Tree-model, wherein it is regression tree set that gradient, which promotes tree-model, comprising there are more regression trees, one point of every regression tree Split the feature that node corresponds to training sample；Pass through the corresponding split vertexes of feature same in statistical gradient promotion tree-model Average yield value, using average yield value as the scoring of character pair, and then realize to the features of two training sample data into Row marking；Finally the scoring based on each feature carries out feature ordering and exports ranking results again, for carrying out feature selecting, In, it scores higher, the importance of feature is also higher.The present invention, which is realized, carries out federal instruction using the training sample of different data side Practice modeling, and then realizes the feature selecting of multi-party sample data.

Detailed description of the invention

Fig. 1 is the knot for the hardware running environment being related to the present invention is based on the feature selecting Installation practice scheme of federation's training Structure schematic diagram；

Fig. 2 is that the present invention is based on the flow diagrams of one embodiment of feature selection approach of federation's training；

Fig. 3 is the refinement flow diagram of mono- embodiment of step S10 in Fig. 2；

Fig. 4 is that the present invention is based on the training result schematic diagrames of one embodiment of feature selection approach of federation's training.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that described herein, specific examples are only used to explain the present invention, is not intended to limit the present invention.

The present invention provides a kind of feature selecting device based on federation's training.

As shown in Figure 1, Fig. 1 is the hardware fortune being related to the present invention is based on the feature selecting Installation practice scheme of federation's training The structural schematic diagram of row environment.

The present invention is based on the feature selecting devices of federation's training can be PC, and being also possible to server etc. has meter The equipment for calculating processing capacity.

As shown in Figure 1, the feature selecting device based on federation's training may include: processor 1001, such as CPU, network Interface 1004, user interface 1003, memory 1005, communication bus 1002.Wherein, communication bus 1002 is for realizing these groups Connection communication between part.User interface 1003 may include display screen (Display), input unit such as keyboard (Keyboard), optional user interface 1003 can also include standard wireline interface and wireless interface.Network interface 1004 is optional May include standard wireline interface and wireless interface (such as WI-FI interface).Memory 1005 can be high speed RAM memory, It is also possible to stable memory (non-volatile memory), such as magnetic disk storage.Memory 1005 optionally may be used also To be independently of the storage device of aforementioned processor 1001.

It will be understood by those skilled in the art that the feature selecting apparatus structure based on federation's training shown in Fig. 1 is not The restriction of structure twin installation may include perhaps combining certain components or different portions than illustrating more or fewer components Part arrangement.

As shown in Figure 1, as may include that operating system, network are logical in a kind of memory 1005 of computer storage medium Believe module, Subscriber Interface Module SIM and file copy program.

In feature selecting device based on federation's training shown in Fig. 1, network interface 1004 is mainly used for connection backstage Server carries out data communication with background server；User interface 1003 is mainly used for connecting client (user terminal), with client End carries out data communication；And processor 1001 can be used for calling the feature selecting program stored in memory 1005, and execute It operates below:

Further, the training sample of described two alignment is respectively the first training sample and the second training sample；It is described First training sample attribute includes sample ID and part sample characteristics, and the second training sample attribute includes sample ID, another A part of sample characteristics and data label；First training sample is provided by the first data side and is stored in the first data side Local, second training sample is provided by the second data side and is stored in the second data side local；The calling of processor 1001 is deposited The feature selecting program stored in reservoir 1005 also executes following operation:

Further, processor 1001 calls the feature selecting program stored in memory 1005 also to execute following operation:

In first data side side, the First-order Gradient and the second order gradient based on encryption calculate local and institute State the financial value of the corresponding training sample of sample ID split vertexes under each divisional mode；

Or in first data side side, First-order Gradient used in first run node split and second order gradient are continued to use, it counts Calculate the financial value of local training sample split vertexes under each divisional mode corresponding with the sample ID；

The second data side is sent to after encrypting to the financial value of all split vertexes.

Based on the hardware running environment that the above-mentioned feature selecting Installation practice scheme based on federation's training is related to, this is proposed The following embodiment of feature selection approach of the invention based on federation's training.

It is that the present invention is based on the flow diagrams of one embodiment of feature selection approach of federation's training referring to Fig. 2, Fig. 2.This In embodiment, it is described based on federation training feature selection approach the following steps are included:

Step S10 carries out federal training using the training sample that XGboost algorithm is aligned two, is mentioned with constructing gradient Rise tree-model, wherein it includes more regression trees, the corresponding instruction of a split vertexes of the regression tree that the gradient, which promotes tree-model, Practice a feature of sample；

XGboost (eXtreme Gradient Boosting) algorithm is in GBDT (Gradient Boosting Decision Tree, gradient boosted tree) improvement that Boosting algorithm is carried out on the basis of algorithm, the use of internal decision making tree Be regression tree, it includes more regression trees that algorithm output, which is the set of regression tree, and the basic ideas of training study are traversal instructions All dividing methods (namely mode of node split) for practicing all features of sample, select the dividing method of loss reduction, obtain Two leaves (namely split vertexes and generate new node), then proceed to traverse, until:

(1) stop splitting condition if meeting, export a regression tree；

(2) stop iterated conditional if meeting, export a regression tree set.

In the present embodiment, the training sample that XGboost algorithm uses is two independent training samples namely each instruction Practice sample and belongs to different data sides respectively.If two training samples are regarded as a whole training sample, due to two Training sample belongs to different data sides, therefore, can regard as and carry out cutting to whole training sample, and then training sample is The different characteristic of same sample (sample is longitudinal sectional).

Furthermore.Since two training samples belong to different data sides respectively, to realize federal training modeling, need Sample alignment is carried out to the raw sample data that both sides provide.

In the present embodiment, federation's training refers to sample training process by two data sides cooperate it is common complete, it is final trained To the gradient boosted tree model regression tree that includes, split vertexes correspond to the feature of both sides' training sample.

Step S20 counts the average yield value that the gradient promotes the corresponding split vertexes of same feature in tree-model, and Using the average yield value as the scoring of character pair；

In XGboost algorithm, when traversing all dividing methods of all features of training sample, evaluated by financial value The superiority and inferiority of dividing method, each split vertexes all select the dividing method of loss reduction.Therefore, the financial value of split vertexes can be made It is characterized the Appreciation gist of importance, the financial value of split vertexes is bigger, then node allocation loss is smaller, and then the split vertexes The importance of corresponding feature is also bigger.

It include more regression trees since the gradient that training obtains is promoted in tree-model in the present embodiment, and different recurrence For tree there is a possibility that carrying out node allocation with same characteristic features, therefore, it is necessary to statistical gradients to promote all recurrence that tree-model includes The average yield value of the corresponding split vertexes of same feature in tree, and using average yield value as the scoring of character pair.

Step S30, the scoring based on each feature carries out feature ordering and exports ranking results, for carrying out feature selecting, Wherein, if there is the feature for not corresponding to split vertexes in training sample, this feature uses default scoring.

In the present embodiment, the scoring height of feature represents the significance level of feature, the meeting after the scoring for obtaining each feature It carries out feature ordering and exports ranking results, for example sort from high to low, then the importance for coming the feature of front, which is higher than, to be come Subsequent feature.It therefore, can be with feature unrelated with sample predictions or classification in Rejection of samples by feature selecting.For example, learning Include in raw sample: gender, the rate of attendance, praises number at school grade, if class object is that excellent student and non-three are eager to learn It is raw, then feature gender obviously with whether be that excellent student is unrelated or association less, therefore can reject.

The present embodiment carries out federal training using the training sample that XGboost algorithm is aligned two, is mentioned with constructing gradient Rise tree-model, wherein it is regression tree set that gradient, which promotes tree-model, comprising there are more regression trees, one of every regression tree Split vertexes correspond to a feature of training sample；Pass through the corresponding split vertexes of feature same in statistical gradient promotion tree-model Average yield value, using average yield value as the scoring of character pair, and then realize the feature to two training sample data It gives a mark；Finally the scoring based on each feature carries out feature ordering and exports ranking results again, for carrying out feature selecting, In, it scores higher, the importance of feature is also higher.The present embodiment, which is realized, carries out federation using the training sample of different data side Training modeling, and then realize the feature selecting of multi-party sample data.

Further, the specific implementation of joint training of the invention for ease of description, the present embodiment is specifically with two Independent training sample is illustrated.

In the present embodiment, the first data side provide the first training sample, the first training sample attribute include sample ID and Part sample characteristics；Second data side provides the second training sample, and the second training sample attribute includes sample ID, another part sample Eigen and data label.

Wherein, sample characteristics refer to that the feature that sample shows or has, such as sample are behaved, then corresponding sample characteristics It can be age, gender, income, educational background etc..Data label is for classifying to multiple and different samples, the result tool of classification The feature that body is dependent on sample carries out determining to obtain.

The major significance that federal training of the invention is modeled is to realize the two-way secret protection of both sides' sample data.Cause This, in federal training process, the first training sample is stored in the first data side local, and the second training sample is stored in the second number According to square local, such as in following table 1, data are provided by the first data side and are stored in the first data side local, number in surface table 2 It is local according to being provided by the second data side and being stored in the second data side.

Table 1

Sample ID	Age	Gender	Amount of given credit
				X1	20	1	5000
X2	30	1	300000
				X3	35	0	250000
X4	48	0	300000
				X5	10	1	200

As shown in Table 1, the first training sample attribute include sample ID (X1~X5), Age feature, Gender feature with And Amount of given credit feature.

Table 2

Sample ID	Bill Payment	Education	Lable
				X1	3102	2	24
X2	17250	3	14
				X3	14027	2	16
X4	6787	1	10
				X5	280	1	26

Shown in table 2 as above, the second training sample attribute include sample ID (X1~X5), Bill Payment feature, Education feature and data label Lable.

It further, is the refinement flow diagram of mono- embodiment of step S10 in Fig. 2 referring to Fig. 3, Fig. 3.Based on above-mentioned reality Apply example, in the present embodiment, above-mentioned steps S10 is specifically included:

Step S101 obtains each training sample in the corresponding sample set of epicycle node split in second data side side First-order Gradient and second order gradient；

XGboost algorithm is a kind of machine learning modeling method, is needed using classifier (namely classification function) sample Data are mapped to some in given classification, predict so as to be applied to data.Utilizing classifier learning classification rule In the process, need to judge using loss function the error of fitting size of machine learning.

In the present embodiment, when carrying out node split every time, in the second data side side, it is corresponding to obtain epicycle node split The First-order Gradient of each training sample and second order gradient in sample set.

Wherein, gradient promotion tree-model needs to carry out the training of more wheel federations, and the training of each round federation is corresponding to be generated one time Gui Shu, and the generation of a regression tree needs to carry out multiple node split.

Therefore, in each round federation training process, node split uses the training sample for most starting to save for the first time, Node split next time then will use the training sample that new node caused by last node split corresponds to sample set, and In the federal training process of same wheel, each round node split all continues to use First-order Gradient used in first run node split and two ladders Degree.And federation's training of next round will use last round of federal training result and update a ladder used in last round of federal training Degree and second order gradient.

XGboost algorithm supports customized loss function, asks single order inclined objective function using customized loss function Derivative and second-order partial differential coefficient, the corresponding First-order Gradient and second order gradient for obtaining local sample data to be trained.

Therefore the explanation for promoting tree-model in based on the above embodiment for XGboost algorithm and gradient constructs regression tree It needs to be determined that split vertexes, and split vertexes can be determined by financial value.The calculation formula of financial value gain is as follows:

Wherein, I_LRepresent the sample set for including of present node division rear left child node, I_RAfter representing present node division The sample set for including of right child node, g_iIndicate the First-order Gradient of sample i, h_iIndicate the second order gradient of sample i, λ, γ are normal Number.

Since sample data to be trained is respectively present the first data side and the second data side, therefore, it is necessary in the first number The financial value of respective sample data split vertexes under each divisional mode is calculated separately according to square side and the second data side side.

In the present embodiment, it is aligned since the first data side has carried out sample with the second data side in advance, thus both sides have Therefore identical Gradient Features, are based on the second data simultaneously because data label is present in the sample data of the second data side The First-order Gradient and second order gradient of the sample data of side, calculate both sides' sample data split vertexes under each divisional mode Financial value.

Step S102, if epicycle node split be construct regression tree first run node split, to the First-order Gradient with The second order gradient is sent to the first data side together with the sample ID of the sample set after being encrypted, for described The First-order Gradient and the second order gradient of the first data side's side group in encryption, calculate local instruction corresponding with the sample ID Practice the financial value of sample split vertexes under each divisional mode；

In the present embodiment, to realize the two-way secret protection for realizing both sides' sample data in federal training process, therefore, if Epicycle node split is the first run node split for constructing regression tree, then the single order of sample data is calculated in the second data side side After gradient and second order gradient, is first encrypted, be then then forwarded to the first data side.

In the first data side side, First-order Gradient and second order gradient and above-mentioned income based on the sample data received The receipts of first data side local sample data split vertexes under each divisional mode are calculated in the calculation formula of value gain Benefit value, since First-order Gradient and second order gradient are encrypted, the financial value being calculated is also secret value, thus nothing Financial value need to be encrypted.

Under the various partitioning schemes for calculating sample data after the financial value of split vertexes, generation new node can be divided To construct regression tree.The present embodiment is preferably had the leading building gradient boosted tree in the second data side of data label by sample data The regression tree of model.Therefore, it is necessary to the first data side local sample datas that will be calculated in the first data side side each The financial value of split vertexes is sent to the second data side under kind divisional mode.

Step S103, if epicycle node split is the non-first run node split for constructing regression tree, by the sample set Sample ID is sent to the first data side, in first data side lateral edge single order used in first run node split Gradient and second order gradient calculate local training sample split vertexes under each divisional mode corresponding with the sample ID Financial value；

It, only need to be by epicycle section if epicycle node split is the non-first run node split for constructing regression tree in the present embodiment The sample ID of the corresponding sample set of dot splitting is sent to the first data side, and when the first data side continues to continue to use first run node split Used First-order Gradient and second order gradient calculate local training sample corresponding with the sample ID received in each division The financial value of split vertexes under mode.

Step S104, the second data side receive the encryption financial value for all split vertexes that the first data side returns simultaneously It is decrypted；

Step S105, in second data side side, based on the First-order Gradient and the second order gradient, calculate it is local with The financial value of the corresponding training sample split vertexes under each divisional mode of the sample ID；

In the second data side side, First-order Gradient and second order gradient and above-mentioned receipts based on the sample data being calculated The calculation formula of beneficial value gain calculates the local sample data to be trained in the second data side and divides section under each divisional mode The financial value of point.

Step S106 determines epicycle node split based on the financial value of the respective calculated all split vertexes of both sides Global best split vertexes；

Since the initial sample data of both sides has carried out sample alignment, respectively calculated all divisions save both sides The financial value of point can regard the financial value to both sides' overall data sample split vertexes under each divisional mode as, because This, by comparing the size of financial value, using the maximum split vertexes of financial value as best point of the overall situation of epicycle node split Split node.

It should be noted that the best corresponding sample characteristics of split vertexes of the overall situation be both likely to belong to the first data side Training sample, it is also possible to belong to the training sample of the second data side.

Optionally, it is dominated since the regression tree that gradient promotes tree-model is constructed by the second data side, in the second data Square side needs to record the relevant information for the best split vertexes of the overall situation that each round node split determines；Relevant information includes: correspondence The provider of sample data, the feature coding and financial value for corresponding to sample data.

For example, if data side A holds the corresponding feature f of global optimal partition point_i, then this is recorded as (SiteA, E_A (f_i),gain).Conversely, if data side B holds the corresponding feature f of global optimal partition point_i, then this is recorded as (Site B, E_B (f_i),gain).Wherein, E_A(f_i) indicate data side A to feature f_iIt is encoded, E_B(f_i) indicate data side B to feature f_iIt carries out Coding can indicate feature f by coding_iWithout revealing its initial characteristic data.

Optionally, when carrying out feature selecting in the above-described embodiments, preferably using each global best split vertexes as gradient The split vertexes for promoting each regression tree in tree-model, count the average yield value of the corresponding split vertexes of same feature coding.

Step S107, the best split vertexes of the overall situation based on epicycle node split, to the corresponding sample set of present node into Line splitting generates new node to construct the regression tree that gradient promotes tree-model.

If the best corresponding sample characteristics of split vertexes of the overall situation of epicycle node split belong to the training sample of the first data side This, then the corresponding sample data of present node of epicycle segmentation belongs to the first data side.Correspondingly, if epicycle node split it is complete The best corresponding sample characteristics of split vertexes of office belong to the training sample of the second data side, then the present node of epicycle segmentation is corresponding Sample data belong to the second data side.

By node split, that is, new node (left child node and right child node) is produced, to construct regression tree.And lead to Excessive wheel node split, then can be continuously generated new node, and then obtain the tree deeper regression tree of depth, and if Stop node The regression tree that gradient promotes tree-model then can be obtained in division.

In the present embodiment, since the data that both sides calculate communication are all the encryption data of model intermediate result, training Process will not reveal initial characteristic data.Guarantee the privacy of data in entire training process using Encryption Algorithm simultaneously. It is preferred that using part homomorphic encryption algorithm, additive homomorphism is supported.

Further, in one embodiment, the difference based on node split condition, is used for especially by following manner The First-order Gradient and second order gradient of the training sample of node split:

1, first regression tree of the corresponding construction of epicycle node split

If 1.1, epicycle node split is the first run node split for constructing first regression tree, in the second data side side, just The First-order Gradient of each training sample and second order gradient in the corresponding sample set of beginningization epicycle node split；

If 1.2, epicycle node split is the non-first run node split for constructing first regression tree, first run node split is continued to use Used First-order Gradient and second order gradient.

2, epicycle node split is corresponding constructs non-first regression tree

If 2.1, the corresponding first run node split for constructing non-first regression tree of epicycle node split, according to last round of federation Training updates First-order Gradient and second order gradient；

If 2.2, epicycle node split is the non-first run node split for constructing non-first regression tree, first run node point is continued to use First-order Gradient used in splitting and second order gradient.

Further, in one embodiment, be reduce the complexity of regression tree, therefore the depth threshold of default regression tree with Carry out node split limitation.

In the present embodiment, when each round, which generates new node, promotes the regression tree of tree-model to construct gradient, second Data side side, judges whether the depth of epicycle regression tree reaches predetermined depth threshold value；

If the depth of epicycle regression tree reaches predetermined depth threshold value, Stop node division, and then obtains gradient boosted tree Otherwise one regression tree of model continues next round node split.

It should be noted that the condition of limitation node split is also possible to the Stop node point when node cannot continue division It splits, such as the corresponding sample of present node, then can not continue node split.

Further, in another embodiment, to avoid training process overfitting, therefore the quantity threshold of regression tree is preset Value is to limit the generation quantity of regression tree.

In the present embodiment, when Stop node division, in the second data side side, judge epicycle regression tree total quantity whether Reach preset quantity threshold value；

If the total quantity of epicycle regression tree reaches preset quantity threshold value, stop federal training, otherwise continues next round connection Nation's training.

It should be noted that the condition of the generation quantity of limitation regression tree is also possible to stop when node cannot continue division Only construct regression tree.

For a better understanding of the invention, below based on sample data in table 1,2 in above-described embodiment, to federal instruction of the invention White silk is illustrated with modeling process.

First round federation training: first regression tree of training

(1) first round node split

1.1, in the second data side side, computational chart 2 sample data First-order Gradient (g_i) and second order gradient (h_i)；To g_i And h_iThe first data side is sent to after being encrypted；

1.2, in the first data side side, it is based on g_iAnd h_i, lower point of all possible divisional mode of sample data in computational chart 1 Split the financial value gain of node；Financial value gain is sent to the second data side；

Since Age feature with 5 kinds of sample data division modes, Gender feature there are 2 kinds of sample datas to divide in table 1 Mode, Amount of given credit 5 kinds of sample data division modes of feature, therefore, sample data has altogether in table 1 12 kinds of divisional modes, namely need to calculate the financial value of the corresponding split vertexes of 12 kinds of division modes.

1.3, in the second data side side, computational chart 2 under all possible divisional mode of sample data split vertexes receipts Beneficial value gain；

Due in table 2 Bill Payment feature with 5 kinds of sample data division modes, Education feature have 3 kinds Sample data division mode, therefore, sample data has 8 kinds of divisional modes altogether in table 2, namely needs to calculate 8 kinds of division sides The financial value of the corresponding split vertexes of formula.

1.4, from the financial value of the corresponding split vertexes of the calculated 12 kinds of division modes in the first data side side and from In the financial value of the corresponding split vertexes of the calculated 8 kinds of division modes in two data sides side, the corresponding spy of maximum return value is selected Levy the best split vertexes of the overall situation as epicycle node split；

1.5, the best split vertexes of the overall situation based on epicycle node split, divide the corresponding sample data of present node It splits, generates new node to construct the regression tree that gradient promotes tree-model.

1.6, judge whether the depth of epicycle regression tree reaches predetermined depth threshold value；If the depth of epicycle regression tree reaches pre- If depth threshold, then Stop node divides, and then obtains the regression tree that gradient promotes tree-model, otherwise continues next round section Dot splitting；

1.7, judge whether the total quantity of epicycle regression tree reaches preset quantity threshold value；If the total quantity of epicycle regression tree reaches To preset quantity threshold value, then stop federal training, otherwise continues the training of next round federation.

(2) second and third wheel node split

2.1, assume that the corresponding feature of last round of split vertexes is that Bill Payment is less than or equal to 3102, then this feature As split vertexes (corresponding sample be X1, X2, X3, X4, X5), two new partial nodes are generated, wherein left sibling is to should be less than Or the sample set (X1, X5) equal to 3102, and right node is to the sample set (X2, X3, X4) that should be greater than 3102, by sample set It closes (X1, X5) and sample set (X2, X3, X4) and continues second and third wheel node split respectively as new sample set, with right respectively Two new nodes are divided, and new node is generated.

2.2, since second and third wheel node split belongs to the federal training of same wheel, continue to continue to use first round node point Sample gradient value used in splitting.Assuming that the corresponding feature of a split vertexes of epicycle is Amount of given credit Less than or equal to 200, then this feature generates two new partial nodes, wherein left as split vertexes (corresponding sample is X1, X5) The corresponding sample X5 less than or equal to 200 of node, and right node is to the sample X1 that should be greater than 200；Similarly, epicycle another The corresponding feature of split vertexes is that Age is less than or equal to 35, then this feature is as split vertexes (corresponding sample be X2, X3, X4), Generate two new partial nodes, wherein left sibling it is corresponding be less than or equal to 35 sample X2, X3, and right node is to should be greater than 35 Sample X4.Specific implementation flow refers to first round node split process.

The federal training of second wheel: second regression tree of training

3.1, it since epicycle node split belongs to the training of next round federation, is updated with last round of federal training result First-order Gradient and second order gradient used in the federal training of one wheel continue the federal training of the second wheel and carry out node split, to generate New node constructs next regression tree, and specific implementation flow refers to the building process of previous regression tree.

3.2, as shown in figure 4, sample data produces two after the training of two-wheeled federation in table 1,2 in above-described embodiment Regression tree, first regression tree includes three split vertexes, is respectively: Bill Payment is less than or equal to 3102, Amount Of given credit is less than or equal to 200, Age and is less than or equal to 35；Second regression tree includes two split vertexes, point Be not: Bill Payment is less than or equal to 6787, Gender==1.

3.3, two regression trees of tree-model are promoted based on gradient as shown in Figure 4, the feature of sample data is corresponding flat Equal financial value: Bill Payment is (gain1+gain4)/2；Education is 0；Age is gain3；Gender is gain5； Amount of given credit is gain2.

The present invention also provides a kind of computer readable storage mediums.

Feature selecting program is stored on computer readable storage medium of the present invention, the feature selecting program is by processor The step of feature selection approach as described in the examples such as any of the above-described based on federation's training is realized when execution.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as ROM/RAM), including some instructions are used so that a terminal (can be mobile phone, computer, server or network are set It is standby etc.) execute method described in each embodiment of the present invention.

The embodiment of the present invention is described with above attached drawing, but the invention is not limited to above-mentioned specific Embodiment, the above mentioned embodiment is only schematical, rather than restrictive, those skilled in the art Under the inspiration of the present invention, without breaking away from the scope protected by the purposes and claims of the present invention, it can also make very much Form, it is all using equivalent structure or equivalent flow shift made by description of the invention and accompanying drawing content, directly or indirectly Other related technical areas are used in, all of these belong to the protection of the present invention.

Claims

1. a kind of feature selection approach based on federation's training, which is characterized in that the feature selecting side based on federation's training Method the following steps are included:

Federal training is carried out using the training sample that XGboost algorithm is aligned two, promotes tree-model to construct gradient, In, it includes more regression trees that the gradient, which promotes tree-model, and a split vertexes of the regression tree correspond to the one of training sample A feature；

Count the average yield value that the gradient promotes the corresponding split vertexes of same feature in tree-model, and by the average receipts Scoring of the benefit value as character pair；

Scoring based on each feature carries out feature ordering and exports ranking results, for carrying out feature selecting, wherein if training sample There is the feature for not corresponding to split vertexes in this, then this feature uses default scoring.

2. the feature selection approach as described in claim 1 based on federation's training, which is characterized in that the instruction of described two alignment Practicing sample is respectively the first training sample and the second training sample；

The first training sample attribute includes sample ID and part sample characteristics, and the second training sample attribute includes sample This ID, another part sample characteristics and data label；

First training sample provided by the first data side and be stored in the first data side local, second training sample by Second data side provides and is stored in the second data side local.

3. the feature selection approach as claimed in claim 2 based on federation's training, which is characterized in that described to use XGboost The training sample that algorithm is aligned two carries out federal training, includes: to construct gradient promotion tree-model

In second data side side, the First-order Gradient and two of each training sample in the corresponding sample set of epicycle node split is obtained Ladder degree；

If epicycle node split is the first run node split for constructing regression tree, to the First-order Gradient and the second order gradient into The first data side is sent to together with the sample ID of the sample set after row encryption, in first data side side group The First-order Gradient and the second order gradient in encryption calculate local training sample corresponding with the sample ID at each The financial value of split vertexes under divisional mode；

If epicycle node split is the non-first run node split for constructing regression tree, the sample ID of the sample set is sent to institute The first data side is stated, in first data side lateral edge First-order Gradient used in first run node split and two ladders Degree calculates the financial value of local training sample split vertexes under each divisional mode corresponding with the sample ID；

Local and ID pairs of the sample is calculated based on the First-order Gradient and the second order gradient in second data side side The financial value of the training sample answered split vertexes under each divisional mode；

Based on the financial value of the respective calculated all split vertexes of both sides, the best division section of the overall situation of epicycle node split is determined Point；

The best split vertexes of the overall situation based on epicycle node split, divide the corresponding sample set of present node, generate new Node with construct gradient promoted tree-model regression tree.

4. the feature selection approach as claimed in claim 3 based on federation's training, which is characterized in that described in second number According to square side, the step of obtaining the First-order Gradient and second order gradient of each training sample in the corresponding sample set of epicycle node split it Before, further includes:

If epicycle node split first regression tree of corresponding construction, judge whether epicycle node split is first regression tree of construction First run node split；

If epicycle node split is the first run node split for constructing first regression tree, in second data side side, initialization The First-order Gradient of each training sample and second order gradient in the corresponding sample set of epicycle node split；If epicycle node split is construction The non-first run node split of first regression tree, then continue to use First-order Gradient used in first run node split and second order gradient；

If epicycle node split is corresponding to construct non-first regression tree, judge whether epicycle node split is the non-first recurrence of construction The first run node split of tree；

If epicycle node split is the first run node split for constructing non-first regression tree, one is updated according to last round of federal training Ladder degree and second order gradient；If epicycle node split is the non-first run node split for constructing non-first regression tree, the first run is continued to use First-order Gradient used in node split and second order gradient.

5. the feature selection approach as claimed in claim 3 based on federation's training, which is characterized in that described based on federal training Feature selection approach further include:

When generating new node to construct the regression tree of gradient promotion tree-model, in second data side side, epicycle is judged Whether the depth of regression tree reaches predetermined depth threshold value；

If the depth of epicycle regression tree reaches the predetermined depth threshold value, Stop node division obtains gradient and promotes tree-model A regression tree, otherwise continue next round node split.

6. the feature selection approach as claimed in claim 5 based on federation's training, which is characterized in that described based on federal training Feature selection approach further include:

When Stop node division, in second data side side, judge whether the total quantity of epicycle regression tree reaches present count Measure threshold value；

If the total quantity of epicycle regression tree reaches the preset quantity threshold value, stop federal training, otherwise continues next round connection Nation's training.

7. the feature selection approach based on federation's training as described in any one of claim 3-6, which is characterized in that the base In the feature selection approach of federation's training further include:

In second data side side, the relevant information for the best split vertexes of the overall situation that each round node split determines is recorded；

Wherein, the relevant information includes: the feature coding and income of the provider of corresponding sample data, corresponding sample data Value.

8. the feature selection approach as claimed in claim 7 based on federation's training, which is characterized in that the statistics gradient The average yield value for promoting the corresponding split vertexes of same feature in tree-model includes:

In second data side side, each regression tree in tree-model is promoted using each global best split vertexes as the gradient Split vertexes count the average yield value of the corresponding split vertexes of same feature coding.

9. a kind of feature selecting device based on federation's training, which is characterized in that the feature selecting dress based on federation's training It sets including memory, processor and is stored in the feature selecting journey that can be run on the memory and on the processor Sequence is realized as of any of claims 1-8 when the feature selecting program is executed by the processor based on federation The step of trained feature selection approach.

10. a kind of computer readable storage medium, which is characterized in that be stored with feature choosing on the computer readable storage medium Program is selected, is realized when the feature selecting program is executed by processor as of any of claims 1-8 based on federation The step of trained feature selection approach.