CN107423815A

CN107423815A - A kind of computer based low quality classification chart is as data cleaning method

Info

Publication number: CN107423815A
Application number: CN201710665692.5A
Authority: CN
Inventors: 李玉鑑; 余华擎
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-08-07
Filing date: 2017-08-07
Publication date: 2017-12-01
Anticipated expiration: 2037-08-07
Also published as: CN107423815B

Abstract

The invention discloses a kind of computer based low quality classification chart as data cleaning method, low quality classification view data from internet batch collection can effectively be cleaned, so as to obtain higher-quality view data, for training the higher disaggregated model of a discrimination.Detailed process includes：First directly train a preliminary convolutional neural networks using low quality classification view data, data are identified in itself with the network again, wash the low image or quantity to a certain extent of pseudo- probability that Model Identification is this class and be less than a certain degree of image category, repeat said process and reach default standard until obtaining the discrimination of all picture data types.Contrast experiment illustrates that the present invention can effectively lift the classification quality and identification level of view data.

Description

A kind of computer based low quality classification chart is as data cleaning method

Technical field

A kind of cleaning method of the low quality classification view data based on convolutional neural networks, this method can be to from interconnection The low quality classification view data of net batch collection is effectively cleaned, so as to obtain higher-quality view data, for instructing Practice the higher disaggregated model of a discrimination, belong to artificial neural network technology field.

Background technology

Artificial neural network (Artificial Neural Network, i.e. ANN), it is artificial since being the 1980s The study hotspot that smart field rises.It is abstracted from information processing angle to human brain neuroid, and it is simple to establish certain Model, different networks is formed by different connected modes.Neutral net or class are also often directly referred to as in engineering and academia Neutral net.Neutral net is a kind of operational model, is formed by being coupled to each other between substantial amounts of node (or neuron).Each A kind of specific output function of node on behalf, referred to as excitation function (activation function).Company between each two node Connect and all represent one for the weighted value by the connection signal, referred to as weight, this memory equivalent to artificial neural network. The output of network then according to the connected mode of network, weighted value and excitation function it is different and different.And network itself is generally all Certain algorithm of nature or function are approached, it is also possible to a kind of expression of logic strategy.

Convolutional neural networks (Convolutional Neural Network, CNN) are a kind of feedforward neural networks, and One kind of artificial neural network, its artificial neuron can respond the surrounding cells in a part of coverage, for large-scale Image procossing has outstanding performance.Because its unique network structure can be effectively reduced the complexity of Feedback Neural Network, mesh Preceding convolutional neural networks have turned into the study hotspot of current speech analysis and field of image recognition.Its weights share network structure It is allowed to be more closely similar to biological neural network, reduces the complexity of network model, reduces the quantity of weights.The advantage is in network Becoming apparent of showing when being multidimensional image of input, allow input of the image directly as network, avoid tional identification Complicated feature extraction and data reconstruction processes in algorithm.Convolutional neural networks are compared to traditional neutral net, and its feature is such as Under：

1. partially connected (Sparse Connectivity)

Convolutional network utilizes the space of image local special by being forced between adjacent two layers using local connection mode Property, only it is connected in the Hidden unit of m layers with the regional area of the input block of m-1 layers, these partial zones of m-1 layers Domain is referred to as the continuous acceptance region in space.

2. weights share (Shared Weights)

In convolutional neural networks, each sparse filter can cover whole visible range by shared weights, and these are common The unit for enjoying weights forms a Feature Mapping, along with the cooperation with partially connected, constitutes feature extraction layer --- convolution Layer.

3. pond layer (Pooling Layer)

Pond layer is another structure block of convolutional neural networks, and its function is the space chi by progressively reducing sign The very little calculating to reduce in parameter amount and network.Pond layer independent operation on each characteristic pattern.

In addition, convolutional neural networks also include the element of traditional neutral net, such as full articulamentum and common Nonlinear activation function sigmoid, tanh, ReLU etc..

Today of immense success is obtained in convolutional neural networks, a good data set is the convolutional Neural net trained The key point of network model.Common data set has PASCAL VOC, MNIST, ImageNet, CIFAR-10 etc., wherein ImageNet has common 15M high-resolution tape label images of 22K kinds, and image is collected in network, handmarking, is commonly used for entirely The classification performance detection of convolutional neural networks model.

Data described above are general and professional, have passed through substantial amounts of inspection and handmarking.But for common The data of application level, certain the class image that can be got are probably derived from internet reptile, are necessarily mingled among these a lot Noise, the higher data of mass how are cleaned therefrom and give certain evaluation method, be the emphasis place of the present invention.From After the higher data of quality are obtained in noise data, then the training of convolutional neural networks can be used it for, so as to reach one The purpose of fixed practical application.

The content of the invention

The technical solution adopted by the present invention is a kind of cleaning of the low quality classification view data based on convolutional neural networks Method, bag

Include following steps：

A) from internet, batch downloads the view data with label, and arrangement obtains the image data set of M classes altogether DataSet0, wherein the picture number that the i-th class includes is N_i, i=1,2,3 ... M；

B) a convolutional neural networks CNN0 is trained with DataSet0, comprised the following steps that：

I. a convolutional neural networks model is built, and the structure for fixing this network model keeps constant；

Ii. training set of the certain proportion (such as 80%, 90%) as convolutional neural networks is taken at random to DataSet0；

Iii. the test set using the part of non-training set in DataSet0 as convolutional neural networks；

Iv. CNN0 is trained, iterates to after predetermined number of times and network test discrimination is designated as Acc0；

C) it is N to the i-th class image configuration length in DataSet0_iOne dimensional image self-identifying array K_i, specific steps It is as follows：

I. DataSet0 view data is identified with CNN0, is kth class wherein the i-th class jth image recognition Pseudo- probability be designated as p_ijk, k=1,2,3 ... M, and these pseudo- probability are sorted from big to small；

If k=i, note self-identifying rate K in preceding L (such as L=10) pseudo- probability after ii. sorting be present_ij=p_ijk, otherwise Remember K_ij=0；

D) self-identifying array K is analyzed, the low quality part cleaned in the i-th class view data：

I. the average value of the i-th class image self-identifying rate is calculated：

Ii. the standard deviation of the i-th class image self-identifying rate is calculated：

Iii. the cut off value SepVal=μ-σ * α of the i-th class image " low discrimination " are calculated, 1≤α≤10 and be integer, and SepVal>0；

Iv. in the i-th class image, if there is K_ij<SepVal, then wash jth image；Data set is obtained after the completion of cleaning DataSet1；

E) convolutional neural networks for being carried out the same manner again using DataSet1 are trained, and obtain network test discrimination Acc1, record simultaneously relatively and confirm whether cleaning is effective with Acc0；

F) in DataSet1, the i-th class amount of images is counted again, remembers that per class amount of images be N'_i, to N'_iEnter Row is analyzed and cleans a small number of classifications, to reduce influence of the low quality data class to convolutional neural networks：

I. the average value of current M classifications amount of images is calculated：

Ii. the standard deviation of current M classifications amount of images is calculated：

Iii. the cut off value SepVal=μ-σ * α of " minority class " amount of images are calculated, 1≤α≤10 and be integer, and SepVal>0；

Iv. the common m classes of classification that categorical measure in M class images is less than SepVal are counted；

V. remember that m class quantity summation is sum, M class quantity summation is SUM；

If vi. m/M is much larger than sum/SUM, judge the m classes for minority class, it is necessary to which cleaning treatment is fallen；

If m/M and sum/SUM numerical value approaches, then it is assumed that m class quantity is normal, without cleaning treatment.

G) convolutional neural networks for carrying out the same manner again with the data set DataSet2 after cleaning are trained, and obtain net Network tests discrimination Acc2, and record simultaneously relatively and confirms whether cleaning is effective with Acc1；

H) according to the data obtained collection situation, repeat step (d) and (f), the common m ' classes of data category after being cleaned, m '< M；

I) evaluated for the quality of the remaining common sum ' of m ' classes after cleaning view data：

I. all data of the m ' classes in DataSet0 are obtained, note total quantity is SUM ', SUM '>sum′；

Ii. the convolutional neural networks for the same manner being carried out to the m ' classes view data that total amount is SUM ' and sum ' are trained, and are obtained To network test discrimination Acc (SUM ') and Acc (sum '), if Acc (SUM ')<Acc (sum '), the then data after explanation is cleaned It is more beneficial for the classification based training of convolutional neural networks；

Iii. certain data test is randomly or manually extracted as common test from the m ' class data that total amount is sum ' Collection, using the data of SUM ' and the middle removing test parts of sum ' as training set, the convolutional neural networks training of the same manner is carried out, It is Acc (SUM ') and Acc (sum ') to obtain network test discrimination；If Acc (SUM ')<Acc (sum '), then explanation is for same Test set, using the data through over cleaning as training set train obtained by convolutional neural networks generalization ability it is stronger, survey Examination discrimination is higher, i.e., the quality of data is higher.

Brief description of the drawings

Fig. 1 is experiment Integral Thought flow chart.

Fig. 2 is initial data set situation and its convolutional neural networks test discrimination result figure.

Fig. 3 is current data set self-identifying array schematic diagram.

Fig. 4 is to wash the data set situation and its convolutional neural networks test discrimination result after low-quality image Figure.

Fig. 5 is analysis and the classification cleaning situation result figure that for the first time data are carried out with minority class.

Fig. 6 is the data set situation and its convolutional neural networks test discrimination result once cleaned after a small number of classifications Figure.

Fig. 7 is that second of analysis that data are carried out with minority class and classification clean situation result figure.

Fig. 8 is the data set situation and its convolutional neural networks test discrimination result after secondary cleaning minority classification Figure.

Fig. 9 is the comparing result figure for carrying out quality testing to cleaning front and rear class respectively.

Figure 10 is to use same test set, and data cleansing is front and rear to be used as training set, and the convolutional neural networks of training compare Result figure.

Embodiment

Below in conjunction with the accompanying drawings and specific implementation case the invention will be further described：

1. downloading plants and flowers view data in batches from internet, arrangement can obtain 775 classification and amount to 161015 figures, The picture number that wherein the i-th class includes is N_i(i=1,2,3 ... M)；

2. using gained image data set to train a convolutional neural networks, comprise the following steps that：

A) the network model file of AlexNet on python caffe is obtained, and obtains its pre- instruction on ImageNet Practice model file, the initialization for convolutional neural networks；

B) view data concentrate it is random take about 90% data totally 144921 figures be used as training set, remaining 10% 16067 figures are used as test set, carry out convolutional neural networks training using caffe, after iteration 10000 times, obtain network test Discrimination is 39%；

C) it is N that data are concentrated with the i-th class image configuration length_iOne dimensional image self-identifying array K_i(i=1,2,3 ... M), Tool

Body step is as follows：

I. original data collection is identified one by one with the convolutional neural networks that this is trained；

Ii. following processing is done to the pseudo- probability recognition result of the jth figure of the i-th class：

If 1) there is no the i-th class in the first 10 pseudo- probability recognition results that convolutional neural networks return, K is remembered_ij=0；

2) if it is p to have the i-th class and probability in the first 10 pseudo- probability recognition results returned, K is remembered_ij=p；

D) self-identifying array is analyzed, low-quality image data cleansing is carried out to i classes data：

Iii. the cut off value SepVal=μ-σ * α (this experiment takes α=1) of " minority class " amount of images are calculated；

Iv. by the i-th class image, image of the discrimination less than SepVal is left out as low-quality image.

E) cleaning remaining 79198 figures later, about 90% is taken 71298 be used as training set, remaining 7900 as survey Examination collection, carries out convolutional neural networks training with same method again, and it is 59.7% to obtain testing discrimination, compared to initial 39% has not small raising；

F) quantity of this 775 class image after cleaning is counted again, is designated as N'_i, to N'_iRow is analyzed and cleans a small number of classifications, To reduce influence of the low quality data class to sorter network：

I. the average value of current M=775 classifications amount of images is calculated：

Ii. the standard deviation of current M=775 classifications amount of images is calculated：

Iv. after the cleaning of low quality noise image, statistics less than SepVal images classification number totally 178 classes, altogether 1815 figures；Because 178/775 is much larger than 1815/79198, therefore judges that these classes for a small number of low quality data classes, are cleaned Fall；

G) 77383 view data after cleaning take 70000 to be used as training set, and remaining 7383 are used as test set, Convolutional neural networks training is carried out with same method again, it is 60.0% to obtain testing discrimination, than the test of network before Discrimination is slightly lifted；

H) according to the data obtained collection situation, minority class is cleaned again, and step is same as above, and the data category after being cleaned is total to 468 classes；

I) it is as follows for the performance evaluation of totally 70755 view data of remaining 468 classes after cleaning：

I. all data of 468 classes in initial data are obtained, total quantity is 111290；

Ii. the convolutional neural networks for 468 class data that total amount is 111290 and 70755 each being carried out with same method are instructed Practice, it is 60.8% and 62.6% to obtain network test discrimination, illustrates that the data after cleaning are more beneficial for convolutional neural networks Classification based training；

Iii. 10% is randomly selected as public test set test from the 468 class data that total amount is 70755, with Test data are removed in 111290 and 70755 as training set, with same method training convolutional neural networks, obtained net Network test discrimination is respectively 59.6% and 62.6%.This explanation, for same test set, the data network through over cleaning is general Change ability is stronger, and Average Accuracy is higher, i.e., data performance is more preferable.

By experimental result it can be seen that：

1. this data cleansing effect is authentic and valid, by same evaluation method compared with having been lifted on Raw data quality.

2. next step cleaning strategy can determine according to current data set situation, this method is more flexible.

3. in the case of test set identical, the discrimination that the data after cleaning obtain convolutional neural networks is higher, explanation The quality of data increases after cleaning.

Above example not limits technical scheme described in the invention only to illustrate the present invention.Therefore, all are not Depart from technical scheme and its improvement of the spirit and scope of the present invention, all should cover among scope of the presently claimed invention.

Claims

1. a kind of computer based low quality classification chart is as data cleaning method, it is characterised in that：This method includes following step Suddenly, a) from internet, batch downloads the view data with label, and arrangement obtains the image data set DataSet0 of M classes altogether, The picture number that wherein the i-th class includes is N_i, i=1,2,3 ... M；

Ii. training set of the certain proportion as convolutional neural networks is taken at random to DataSet0；

C) it is N to the i-th class image configuration length in DataSet0_iOne dimensional image self-identifying array K_i, comprise the following steps that：

I. DataSet0 view data is identified with CNN0, the puppet that wherein the i-th class jth image recognition is kth class Probability is designated as p_ijk, k=1,2,3 ... M, and these pseudo- probability are sorted from big to small；

If k=i, note self-identifying rate K in preceding L pseudo- probability after ii. sorting be present_ij=p_ijk, otherwise remember K_ij=0；

E) convolutional neural networks for being carried out the same manner again using DataSet1 are trained, and obtain network test discrimination Acc1, Record simultaneously relatively and confirms whether cleaning is effective with Acc0；

F) in DataSet1, the i-th class amount of images is counted again, remembers that per class amount of images be N'_i, to N'_iDivided Analyse and clean a small number of classifications, to reduce influence of the low quality data class to convolutional neural networks：

Iii. the cut off value SepVal=μ-σ * α of " minority class " amount of images are calculated, 1≤α≤10 and be integer, and SepVal> 0；

If m/M and sum/SUM numerical value approaches, then it is assumed that m class quantity is normal, without cleaning treatment；

G) convolutional neural networks for carrying out the same manner again with the data set DataSet2 after cleaning are trained, and obtain network survey Discrimination Acc2 is tried, record simultaneously relatively and confirms whether cleaning is effective with Acc1；

H) according to the data obtained collection situation, repeat step (d) and (f), the common m ' classes of data category after being cleaned, m '<M；i) Evaluated for the quality of the remaining common sum ' of m ' classes after cleaning view data：

Ii. the convolutional neural networks for the same manner being carried out to the m ' classes view data that total amount is SUM ' and sum ' are trained, and obtain net Network test discrimination Acc (SUM ') and Acc (sum '), if Acc (SUM ')<Acc (sum '), the then data after explanation is cleaned more have Beneficial to the classification based training of convolutional neural networks；

Iii. certain data test is randomly or manually extracted as common test collection from the m ' class data that total amount is sum ', with The data of SUM ' and the middle removing test parts of sum ' carry out the convolutional neural networks training of the same manner, obtained as training set Network test discrimination is Acc (SUM ') and Acc (sum ')；If Acc (SUM ')<Acc (sum '), then explanation is for same survey Examination collection, the convolutional neural networks generalization ability obtained by being trained using the data through over cleaning as training set is stronger, and test is known Not rate is higher, i.e., the quality of data is higher.