CN107038479A

CN107038479A - A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding

Info

Publication number: CN107038479A
Application number: CN201710315280.9A
Authority: CN
Inventors: 戴文华; 焦翠珍; 钱涛; 赵君喆; 闻彬; 江伟; 厉阳春; 范平
Original assignee: Hubei University of Science and Technology
Current assignee: Hubei University of Science and Technology
Priority date: 2017-05-08
Filing date: 2017-05-08
Publication date: 2017-08-11

Abstract

The invention provides a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, belong to data analysis and processing technology field.It solves existing clustering algorithm clusters number and is difficult to determine, cluster influence of the initial center selection to cluster result, and the cluster not high technical problem of Efficiency and accuracy.A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, including the step such as variable length chromosome coding, insert and delete crossover operator, the processing of mutation operator, initialization of population, the design of fitness function.The present invention has the advantages that cluster Efficiency and accuracy height, wide adaptation range.

Description

A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding

Technical field

The invention belongs to data analysis and processing technology field, the hybrid parallel for being related to a kind of variable length chromosome coding is lost Pass clustering algorithm.

Background technology

1st, K-Means algorithms

Traditional K-Means algorithms are a kind of unsupervised learning algorithms of known cluster class number.Its basic thought is： Specify classification number to be K in the algorithm, sample is clustered.Cluster process is right based on K selected at random cluster centre Sample is divided by apart from minimum principle, and iteration updates cluster centre, so that iterative process is to target function value minimum Direction is close, so that the Clustering Effect being optimal.

In K-Means algorithms it is general using formula (1) as object function come the end of explicitly evaluation algorithm whether.

Wherein DIS (X_i,Z_j) calculated by formula (2), K is cluster numbers, X_iTo belong to class C_jCluster sample, Z_jIt is poly- Class center.The meaning of object function is actually summation of the Different categories of samples to its central point distance.

Traditional K-Means algorithms are comprised the following steps that：

1. the data set X that size is n is given；

2. K initial cluster center Z is chosen_j(j=1,2 ..., K)；

3. with Z_jData set X is divided by most adjacent principle for reference point, each sample is divided into different clusters；

If X_i, Z_jFormula (3) is met, then X_iBelong to jth class.

4. cluster centre is adjusted according to formula (4)；

Wherein z_ijRepresent the value of the jth dimension at No. i-th center, n_iFor class C_iMiddle sample point number, X_kTo belong to class C_iSample Point.x_kjFor sample point X_kJth dimension value.

5. the value J of formula (1) calculating target function is pressed；

If 6. J value changes little in wheel iteration more, algorithm terminates, and otherwise turns 3..

K-Means clustering algorithms are a kind of important methods in data mining and Knowledge Discovery field, and it has algorithm letter Single, local search ability is strong and the characteristics of fast convergence rate.Exactly these features determine that K-Means algorithms are very suitable for height The clustering problem of dimensional vector.

But among the process using K-means algorithms, if cluster numbers are different with the selection of initial cluster center, all Considerable influence will be produced to cluster result.

In order to solve K-Means algorithms to initial cluster center sensitive issue, people seek a variety of methods to K-Means Algorithm is improved, and these improvement are concentrated mainly on the system of selection of initial center and the reasonable utilization to clustering structure.

In K-Means algorithms, initial cluster center is randomly selected, and this system of selection often makes in same category Sample by forcibly as different classes of cluster centre, so clustering will be deviateed.For reasonable selection Initial cluster center, researcher has carried out many experiments and analysis, and cluster centre is in optimized selection using various methods. Wherein simplest measure is to randomly select different initial values algorithm is performed a plurality of times, and then chooses best result.Also have Person proposes that the thought that average point is separated with cluster seed will be clustered, when progress next round cluster seed is calculated, using in cluster Those data larger with last round of cluster seed similarity, calculate their average point (geometric center point) as next round The seed of cluster.Separately there is scholar to propose a kind of KADD innovatory algorithms based on density and object orientation, take clustering object point Cloth density method determines initial cluster center, then finds the cluster of arbitrary shape according to the cluster direction of object.

In addition, some scholars have found that clustering problem generally has a kind of statistics special by the analysis to Clustering Model Property, referred to as Clustering features.Cluster is represented using Clustering features, more clustering informations can be retained, for improving cluster matter Amount has the parameter (such as cluster centre, sum of squares of deviations and cluster radius) in certain effect, and cluster process can be direct Calculated by Clustering features.Multiple sampling method is exactly a kind of method that use Clustering features carry out clustering.Also scholar Propose a kind of clustering algorithm CFK-Means algorithms of typical use Clustering features.

Although can verify that above-mentioned improved K-Means algorithms have to the performance of traditional K-Means algorithms by experiment Very big raising, but local optimum has simply generally been carried out to K-Means algorithms, it still can not strengthen the global search of algorithm Ability.

, will be to algorithm if K-Means clustering algorithms and the paralleling genetic algorithm to be talked about below can be combined Global optimization ability produce great function, while can be optimized to clustering parameter so that algorithm performance is greatly improved.These Problem is all our problems to be solved in research below.

2nd, the selection of K-Means algorithms initial cluster center

The greatest problem that K-Means algorithms are present is the select permeability of initial cluster center, if can correctly choose poly- Class center, and initial cluster center is optimized, the precision of algorithm will be greatly improved.

The selection of current initial cluster center mainly has following several method：

1. K sample is randomly selected as initial cluster center.

This method is most simple, is also easiest to make algorithm be absorbed in locally optimal solution.

2. representational K sample is rule of thumb chosen as initial cluster center.

The choosing method requirement of this initial cluster center has deeper understanding to the characteristic and basic structure of sample, but Many problems in real work, the characteristic and basic structure of sample data in this case, are somebody's turn to do almost without from understanding Method is clearly impossible.

3. carrying out multiple initial cluster center selection and clustering, one group of optimal initial cluster center is found out.

This method is simple and easy to apply, but when sample data volume is larger, in fact it could happen that combination will be one huge Number, if various situations all tested, it will the substantial amounts of machine time of consumption.Obvious this method is suitable only for sample The less situation of quantity.

4. according to statistical law, carry out many sub-samplings and carry out secondary cluster to obtain initial cluster center.

This method produces new multigroup cluster centre to the sample clustering repeatedly extracted, and these cluster centres are gathered again Class, compares cluster result to obtain optimal initial cluster center.

This initial cluster center optimized algorithm is operated to the subset of very little for giving sample, it is therefore desirable to internal memory Compare the internal memory that whole sample set operated much less, it is adaptable to large-scale clustering problem.But, this algorithm is obtained Simply a kind of cluster result of " suboptimum ", and acceptor sample set chooses the influence of mode.

5. the initial cluster center back-and-forth method based on density.

The algorithm represents point according to sample rate selection and is used as initial cluster center.First using each sample object in The heart, using some given positive number R as radius, a spheric neighbo(u)rhood is marked in feature space, the object fallen into the neighborhood is calculated Number as the point density.Then the maximum object of density is chosen as first initial cluster center, and it correspond to object The peak-peak point of distribution density.Finally, a positive number D is given, is selected leaving first initial cluster center outside D Secondary big density points are represented a little as the 2nd, can so avoid representing point undue concentration.The rest may be inferred, can select K initially Cluster centre.

There is a density radius R and minimum range D determination in this method, to different sample sets, the two values Should be different, it so just can guarantee that the accuracy of cluster.

6. initial cluster center is optimized using genetic algorithm or immune programming algorithm.

This kind of optimization method is using the random searching process of the overall situation of genetic algorithm and immune programming algorithm come to initial clustering Center is optimized.Cluster can relatively accurately be described by the cluster centre after genetic algorithm and immune programming algorithm optimization Characteristic, is a kind of more method of use.

Above-mentioned various methods, there is respective advantage and disadvantage.Generally speaking, wherein based on genetic algorithm and immune programming algorithm Initial cluster center optimized algorithm be a kind of preferably algorithm relatively.But genetic algorithm and immune programming algorithm are easy to occur Local precocity phenomenon, algorithm does not play the concurrency of itself presence yet.In view of the situation, we attempt and proposed using simultaneously Row genetic algorithm is optimized to initial cluster center, forms solve the mixed of initial cluster center select permeability on this basis Merging rows genetic algorithm.

3rd, paralleling genetic algorithm

There is a kind of individual migration strategy for being referred to as " marriage " in paralleling genetic algorithm, using the parallel something lost of " marriage " strategy Propagation algorithm is referred to as the paralleling genetic algorithm based on " marriage " strategy.

Paralleling genetic algorithm based on " marriage " strategy imitates the allied strategy of the mankind, prevents as far as possible with mutually homogenic The individual of structure is mated, to avoid the precocity of algorithm.The algorithm with the individual sub- population parallel evolutions of M (M >=2), when population it Between when meeting marriage condition, contemporary optimum individual marriage two-by-two between different population, and by the optimized individual in marriage offspring Copy to the source population of correlation.In genetic process, in order to retain the excellent genes of chromosome, plan can be retained using optimized individual Slightly, the optimized individual in marriage offspring and source population is compared, retains optimum individual, heredity of future generation is participated in as seed Computing.Concrete model is as shown in Figure 1.

Because marriage offspring carries the gene of other populations, therefore on the one hand allied strategy can keep gene in population Diversity, so that the harm that inbreeding is brought is effectively prevent, simultaneously because the excellent genes of other populations are introduced, thus energy Accelerate the search procedure of algorithm.

3rd, the clustering method based on variable length chromosome hybrid parallel genetic algorithm

Because K-means algorithms employ heuristic in the calculating process of cluster centre, thus it is effectively reduced Algorithm complex, improves arithmetic speed.Exactly because also same the reason for so that choosing of the algorithm to initial cluster center Select more sensitive, it is easy to be absorbed in locally optimal solution.

In order to evade this defect, we analyze and proposed a kind of based on hybrid parallel genetic algorithm (Hybrid Parallel Genetic Algorithm, HPGA) clustering method.Concrete model is as shown in Figure 2.Algorithm combination K- The high efficiency and local search ability of means algorithms, and paralleling genetic algorithm global optimization ability, pass through the something lost in population Biography, variation and the parallel evolution between population, marriage, higher Efficiency and accuracy is provided for sample clustering.

The content of the invention

The purpose of the present invention be for existing technology exist above mentioned problem there is provided a kind of variable length chromosome coding Hybrid parallel genetic algorithm for clustering, the technical problems to be solved by the invention be how during genetic evolution adaptively Clusters number and initial cluster center are obtained, and improves the Efficiency and accuracy of cluster.

The present invention proposes a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, for improving K-Means Algorithm ability of searching optimum, improves the precision of algorithm, as follows：

Paralleling genetic algorithm is a kind of effective genetic algorithm for solving premature problem, and it takes full advantage of genetic algorithm Concurrency, great raising is there has also been in efficiency, the accuracy of algorithm is also guaranteed.But the raising of this efficiency Obtained merely by the concurrency of algorithm, do not account for also and carry out part using heuristic in calculating process Optimizing.

K-Means algorithms are a kind of stronger clustering algorithms of local search ability, and it takes into full account makes in calculating process The excavation of cluster centre point is carried out with heuristic, therefore efficiency of algorithm is very high.However, K-Means algorithm global searches Ability is poor, and the selection for initial cluster center point is more sensitive, therefore the precision of algorithm cannot be guaranteed.

Both are fully combined, both advantages can be really played, obtain to the effective of initial cluster center select permeability Solution.

Hybrid parallel genetic algorithm can give full play to the high efficiency and local search ability of K-Means algorithms, and parallel The concurrency and global optimization ability of genetic algorithm, so as to rapidly and accurately find initial cluster center.

A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, its goal of the invention can pass through following technical side Case is realized：

A, variable length chromosome coding

The basic thought of traditional K-Means algorithms is：In the case where K values are determined, find first in K initial clustering The heart, then assigns to respective classification by each sample dot-dash using closest principle, finally adjusts all kinds of centre coordinates, repeatedly The adjustment process of iteration centre coordinate, algorithm stops when meeting object function.

But the determination of K values is very difficult in actual clustering problem, people can only be leaned on empirically determined.It is this to ask Topic will bring the decline of algorithm accuracy.

The characteristics of by analysing in depth K-Means algorithms and paralleling genetic algorithm, it has been found that be combined at both In K-Means clusters based on paralleling genetic algorithm, paralleling genetic algorithm is played a part of being substantially to seek fitness most High chromosome, and the calculating of fitness is unrelated with chromosome length (clusters number).In mixed model, K-Means algorithms Played a part of mainly carry out sample class divide and cluster centre adjustment, this process actually also with chromosome Length does not conflict, simply to different chromosome, and its cluster result divides different.

According to considerations above, it is proposed that a kind of hybrid parallel heredity of variable length chromosome coding of K values dynamic change Clustering algorithm.By the algorithm, we can dynamically obtain the cluster numbers by optimization while sample clustering purpose is reached Mesh, thus cluster accuracy be also inherently greatly improved.

In variable length chromosome coding, volume of the chromogene by the corresponding sample point of initial cluster center in sample set Number represent, its coding form is：C={ c₁,c₂,…,c_t}。

Wherein t is the code length of certain chromosome, and to different chromosome, t value is the c in change_i(i=1, 2 ..., it is t) numbering of the corresponding sample in ith cluster center in sample set, is that (N is for natural number between one [1, N] Number of samples).

For example：Some initial cluster center is made up of sample 3, sample 7, sample 10 and sample 19, then its chromosome coding It is represented by：C₁={ 3,7,10,19 }.If any another initial cluster center, by sample 2, sample 10, sample 15, sample 18 with And sample 20 is constituted, then its chromosome coding is represented by：C₂={ 2,10,15,18,20 }.

Obviously, two chromosome lengths and differ, chromosome C₁Length is 4, there is 4 cluster centres, chromosome C₂Length For 5, there are 5 cluster centres.

B, insert and delete crossover operator

We specially devise insert and delete crossover operator, to adapt to the change of chromosome length in parallel genetic evolutionary process Change.

The thought of insert and delete crossover operator is：Inserted by one section of gene elmination of a chromosome, and by this section of gene The a certain position of another chromosome.

Chromosome insert and delete crossover operator is comprised the following steps that：

1. with father's individual CH₁As deleted chromosome, with father's individual CH₂As chromosome is inserted into, two dyeing are calculated Body CH₁And CH₂Length t₁And t₂；

If 2.Then reselect chromosome CH₂, until

Wherein N is number of samples,To cluster the empirical value of number, the purpose of the value is set to be intended merely to accelerate to calculate Method speed, if requiring higher to arithmetic accuracy, can suitably relax the yardstick of the value.It is required thatIt is to prevent Insertion operation after stain colour solid CH₂Gene blocked and unchanged due to overlength.

3. it is random to generate insertion point position Ins, delete point position Del and the length DLen of insertion or deletion；

Wherein intubating length is DLen with deleting equal length.It is required to meet following condition：

0≤Del ＜ t₁, 0≤Ins≤t₂And DLen ＜ t₁

4. by chromosome CH₁Since being deleted point, length DLen gene section is deleted, sub- individual CH is obtained₁', and will delete The gene section insertion chromosome CH removed₂In, obtain conversion body CH₂*；

5. by conversion body CH₂* the duplicate factor in is removed, and obtains sub- individual CH₂′

6. such as fruit individual CH₂' length overlength, then truncated operation is carried out to it.

After the operation of chromosome insert and delete, the length of chromosome is changed, and this change obviously ensure that something lost The diversity of chromosome during coming into, is conducive to the optimization and search of genetic algorithm.

For the frequency of the dynamic change of this chromosome length, and insert and delete operation, we use dynamic chain The mode of table stores chromosome.This storage mode insert and delete operation is quick, and can dynamic variability with chained list length.

C, mutation operator

The mutation operation step of chromosome is as follows：

1. chromosome length Len is calculated；

2. the natural number C between one [1, Len] is randomly generated, change point number is used as；

3. c=1；

4. the natural number between one and last round of unduplicated [1, Len] is randomly generated, change point is used as；

5. the number r between one [0,1] is randomly generated, if r≤P_m, then turn 6., otherwise directly turn 7.；Wherein P_mTo become Different probability.

6. the non-existent natural number in chromosome between one [1, N] is randomly generated, by father's individual at change point Gene with this natural number replace；

7. c=c+1；

If 8. c>C, exits variation, otherwise turns 4..

D, initialization of population

Because chromosome length is variable, therefore its initialization of population method has the Some features of itself.The kind of chromosome Group's initialization is comprised the following steps that：

1. population scale Gsize is set；

2. I=1；

If 3. 4. I≤Gsize, turn, otherwise terminate initialization；

4. it is randomly provided chromosome length

5. the unduplicated natural number between Len [1, N] is randomly generated, item chromosome Ind is formed；

6. judge whether chromosome Ind exists in population, if there is then turning 4., otherwise turn 7.；

7. I=I+1；

8. turn 3..

E, fitness function

Because chromosome uses Variable Length Code, so the number of cluster centre is not fixed, therefore fitness function with The fitness function of block code is otherwise varied.It is defined as follows：

Wherein Len (Ind) is individual Ind chromosome length.Formula is meant that：Sample during calculating is all kinds of arrives such The distance at center, and ask these apart from sum, obtain all kinds of fitness.The fitness sum of all classes is plus 1 and asks reciprocal, Obtain chromosome Ind fitness.

F, algorithm stopping criterion

Evolutionary generation exceedes after maximum genetic algebra GNUM or colony's average fitness value continuous multi-generation heredity still without substantially Change, genetic algorithm stops.

Brief description of the drawings

Fig. 1 is two population marriage paralleling genetic algorithm models in background technology.

Fig. 2 is the hybrid parallel Genetic Algorithm Model based on two population marriage in the present invention.

Fig. 3 is chromosome insert and delete operation chart in embodiment.

Fig. 4 is the individual ratio of elite and cluster accuracy rate graph of a relation in embodiment.

Fig. 5 is the operation example schematic of mutation operator in embodiment.

Embodiment

The following is specific embodiment of the invention and with reference to accompanying drawing, technical scheme is further described, But the present invention is not limited to these embodiments.

As shown in Fig. 2 the present invention proposes a kind of hybrid parallel genetic algorithm, for improving K-Means algorithm global searches Ability, improves the precision of algorithm, as follows：

A, variable length chromosome coding

In clustering problem, because cluster centre number is difficult to determine, can only set by rule of thumb, it is this empirically determine it is poly- Calculation can produce deviation to cluster result in class, therefore we determine cluster centre in a dynamic fashion using paralleling genetic algorithm Number.

The gene of chromosome represents by numbering of the corresponding sample point of initial cluster center in sample set, its coding form For：C={ c₁,c₂,…,c_t}。

According to considerations above, it is proposed that a kind of hybrid parallel heredity of variable length chromosome coding of K values dynamic change Clustering algorithm.By the algorithm, we can obtain the clusters number by optimization while sample clustering purpose is reached, because The accuracy of this cluster is also inherently greatly improved.

B, insert and delete crossover operator

As shown in figure 3, we devise insert and delete crossover operator, it is long to adapt to chromosome in parallel genetic evolutionary process The change of degree.

If 2.Then reselect chromosome CH₂, until

0≤Del ＜ t₁, 0≤Ins≤t₂And DLen ＜ t₁

From figure 3, it can be seen that after the operation of chromosome insert and delete, length becomes 5 and 8 by 9 and 6 respectively, this Change obviously ensure that the diversity of chromosome in genetic evolution process, be conducive to the optimization and search of genetic algorithm.

C, mutation operator

The mutation operation step of chromosome is as follows：

1. chromosome length Len is calculated；

3. c=1；

7. c=c+1；

If 8. c>C, exits variation, otherwise turns 4..

For example：If there is chromosome CH={ 3,6,10,17,5,8 }, change point is respectively 2 and 6, and variation value is respectively 9 Hes 11, then its mutation operation is as shown in Figure 5.

D, initialization of population

1. population scale Gsize is set；

2. I=1；

If 3. 4. I≤Gsize, turn, otherwise terminate initialization；

4. it is randomly provided chromosome length

7. I=I+1；

8. turn 3..

E, fitness function

F, algorithm stopping criterion

Remarks：

Cluster the empirical value of number：Generally the sample set of N number of sample is clustered, cluster numbers not overHereExactly cluster the empirical value of number.

The mode of dynamic link table stores chromosome：Using each chromogene as dynamic link table a node, with The form storage chromosome of chained list.Due to every chromosome length difference, then it represents that the chained list length of chromosome also can be different, body The variable length chromosome coding in this patent is showed.

Maximum genetic algebra GNUM：Generally heredity is after the evolution of certain algebraically, it will tend towards stability, some Genetic evolution process is relatively slow, and maximum genetic algebra GNUM is set in the case, to avoid long-time numerical behavior from influenceing algorithm Efficiency.

The step of this algorithm can be used for carrying out clustering, text cluster to text, is as follows：

1. document sets to be clustered are subjected to candidate feature word identification, obtain candidate feature word set；

2. candidate feature word set is filtered and extracted, obtain new feature word set；

3. by all text representations are represented into new feature set of words in document sets text vector；

4. random selection text is used as chromogene as cluster centre with text numbering.Formed by this way by The genome that length is not waited into population, and constitute another population in the same fashion；

5. two population parallel evolutions.Each population is each selected, intersected and made a variation, then for each in population Individual, text cluster is carried out using K-Means algorithms, is calculated each individual fitness and is retained elite individual.Per in generation, evolves After the completion of, by the individual marriage two-by-two of elite in two populations, and heredity is carried out, intersects and makes a variation, excellent in marriage individual Body is same to be retained as follow-on seed；

6. judge whether heredity reaches stopping criterion, if reached, stop evolving, be transferred to 7., otherwise turn 5.；

7. fitness highest in two populations is found out individual as initial cluster center；

8. K-Means clusters are carried out to document sets with initial cluster center, obtains final cluster result.

Setup Experiments and interpretation of result：

In order to the various key technologies proposed to us are compared and verify carried algorithm and key technology it is feasible Property, We conducted substantial amounts of experiment.Our experiment porch is Windows XP, is developed using Visual C++6.0, Parallel computation is simulated in the way of multithreading.Experiment content and analysis of experimental results are illustrated one by one below.

Experiment parameter is set to：Parallel population number M=2, population scale m=100, maximum evolutionary generation GNUM=100 generations, Crossover probability Pc=0.86, mutation probability Pm=0.02, elite number of individuals Elite=4.

1st, Text Clustering Algorithm performance test

Hybrid parallel genetic algorithm (Hybrid Parallel Genetic Algorithm, HPGA) is based in order to test Text cluster scheme performance, we devise following experiment.

Test an algorithm stability and efficiency test

K-Means clustering algorithms are respectively adopted, it is the CFK-Means algorithms using Clustering features, classical Genetic Algorithms, mixed The HPGA algorithms proposed in hybrid genetic algorithm HGA (K-Means+GA), paralleling genetic algorithm PGA and this book are modern to State Language Work Committee 100 test documents (totally four classes, per class 25) extracted in Chinese data storehouse are clustered, every kind of test of heuristics 50 times.Survey Test result is shown in Table 1.

The algorithm iteration number of times of table 1 and stability test result

From experimental result it can be seen that K-Means algorithms are a kind of quick clustering algorithms, but algorithm stability is poor, This exactly embodiment of K-Means algorithms to initial cluster center dependence；CFK-Means algorithms have larger carry in stability Height, algorithm mean iterative number of time is smaller, is a kind of preferable innovatory algorithm；GA algorithms are better than CFK- in cluster stability Means algorithms, but genetic evolution algebraically is larger, operation time is relatively long；PGA algorithms are by considering the parallel of genetic algorithm Characteristic so that be significantly improved in calculating speed and cluster stability, but it is still to be improved；HGA is a kind of efficient and steady Fixed genetic algorithm, but it does not make full use of the concurrency of genetic algorithm, therefore still not as HPGA algorithms in efficiency； HPGA algorithms are more prominent in calculating speed and efficiency, and stability is more strengthened, and this combines heredity just because of HPGA algorithms The advantage that the concurrency of algorithm and the high efficiency of K-Means algorithms are obtained.

Experiment two clusters accuracy rate test

We have extracted 155 documents from State Language Work Committee's Modern Chinese corpus, wherein computer document 50, day Literary GEOGRAPHIC ATTRIBUTES document 32, law class document 40, energy and material class document 33.Above-mentioned five kinds of algorithms are respectively adopted and enter style of writing This cluster.By investigating whether the generic relation between any two document unanimously evaluates the effect of cluster.Specific evaluation index For Average Accuracy, its calculation formula is as follows：

Aa=(pa+na)/2 (7)

Wherein na, pa are referred to as passive accuracy rate and positive accuracy rate, and calculation formula is as follows：

Na=d/ (b+d), pa=a/ (a+c) (8)

Relation between any two document, can have in table 2 according to the standard of manual sort and the standard of automatic cluster and arrange The 4 kinds of situations gone out：

Generic relation between the document of table 2

A in formula (8), b, c, d computational methods are：If cluster result belongs to the first situation, a is added 1, if Belong to second of situation, then b is added 1, if belonging to the third situation, c is added 1, if belonging to the 4th kind of situation, by d Plus 1.

Experimental result is shown in Table 3.

Table 3 clusters accuracy rate test result

	K-Means	CFK-Means	GA	PGA	HGA	HPGA
							Average Accuracy	72%	75%	75%	76%	90%	92%

From experimental result as can be seen that the experiment is more consistent with testing the result in one.Experiment shows K-Means algorithms Cluster accuracy rate relatively low；CFK-Means algorithms improve to some extent with respect to K-Means algorithms；HGA algorithm effects are relative in genetic algorithm Preferably；HPGA is then a kind of accuracy rate highest Text Clustering Algorithm.The error of HPGA algorithms is essentially from feature extraction and side The clustering of boundary's document.

Test the relation between the individual ratio of three elite and cluster accuracy rate

Still 155 documents extracted using experiment two, change elite individual amount, and the HPGA proposed using this book is gathered Class algorithm is clustered to text, obtains the graph of a relation between the individual ratio of elite as shown in Figure 4 and cluster accuracy rate.

It can find that cluster accuracy rate change is little when proportion is less than 8% to elite individual in population from Fig. 4, and Higher level can be kept.But when elite individual amount is too small, algorithm the convergence speed is excessively slow, will influence efficiency of algorithm.When Elite individual in population proportion be more than 8% when, cluster accuracy rate drastically decline.The reason for there is this phenomenon be because When proportion is larger in population for elite individual, it is difficult to keep diversity individual in population, precocious receipts easily occurs in algorithm Hold back, evolutionary process will converge on locally optimal solution, so that cluster result produces relatively large deviation, have impact on cluster accuracy.

By experimental analysis above, illustrate elite individual amount in cluster process selection have to clustering precision it is larger Influence.It is typically chosen elite individual and accounts for the 3% to 6% more suitable of Population Size.

By above-mentioned every experiment and interpretation of result, we are completely it can be concluded that using proposed by the invention variable It is a kind of accurately and efficiently algorithm when long chromosome coding hybrid parallel genetic algorithm carries out text cluster.

This algorithm can be additionally used in the fields such as accident identification, image recognition and the identification of industrial products defect ware, then this is not Illustrate one by one.

Specific embodiment described herein is only to spirit explanation for example of the invention.Technology neck belonging to of the invention The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.

Claims

1. a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, comprises the following steps：

A, variable length chromosome coding

The gene of chromosome represents that its coding form is by numbering of the corresponding sample point of initial cluster center in sample set：C ={ c₁,c₂,…,c_t}。

Wherein t is the code length of certain chromosome, and to different chromosome, t value is the c in change_i(i=1,2 ..., t) It is that (N is sample for natural number between one [1, N] for numbering of the corresponding sample in ith cluster center in sample set Number).

B, insert and delete crossover operator

1. with father's individual CH₁As deleted chromosome, with father's individual CH₂As chromosome is inserted into, two chromosomes are calculated CH₁And CH₂Length t₁And t₂；

If 2.Then reselect chromosome CH₂, until

Wherein N is number of samples,To cluster the empirical value of number, the purpose of the value is set to be intended merely to accelerate algorithm speed Degree, if requiring higher to arithmetic accuracy, can suitably relax the yardstick of the value.It is required thatIt is to be grasped to prevent from inserting Make after stain colour solid CH₂Gene blocked and unchanged due to overlength.

0≤Del ＜ t₁, 0≤Ins≤t₂And DLen ＜ t₁

4. by chromosome CH₁Since being deleted point, length DLen gene section is deleted, sub- individual CH is obtained₁', and by deletion Gene section insertion chromosome CH₂In, obtain conversion body CH₂*；

C, mutation operator processing

The mutation operation step of chromosome is as follows：

1. chromosome length Len is calculated；

3. c=1；

5. the number r between one [0,1] is randomly generated, if r≤P_m, then turn 6., otherwise directly turn 7.；Wherein P_mIt is general for variation Rate.

6. the non-existent natural number in chromosome between one [1, N] is randomly generated, by base of father's individual at change point Because being replaced with this natural number；

7. c=c+1；

If 8. c>C, exits variation, otherwise turns 4..

D, initialization of population

The initialization of population of chromosome is comprised the following steps that：

1. population scale Gsize is set；

2. I=1；

If 3. 4. I≤Gsize, turn, otherwise terminate initialization；

4. it is randomly provided chromosome length

7. I=I+1；

8. turn 3..

2. a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding according to claim 1, it is characterised in that The fitness function of shown variable length chromosome is as follows：

3. a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding according to claim 1, it is characterised in that The stopping criterion of this algorithm is：Evolutionary generation exceedes maximum genetic algebra GNUM or colony's average fitness value continuous multi-generation heredity When still unchanged afterwards, this genetic algorithm stops.