CN107038479A - A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding - Google Patents

A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding Download PDF

Info

Publication number
CN107038479A
CN107038479A CN201710315280.9A CN201710315280A CN107038479A CN 107038479 A CN107038479 A CN 107038479A CN 201710315280 A CN201710315280 A CN 201710315280A CN 107038479 A CN107038479 A CN 107038479A
Authority
CN
China
Prior art keywords
chromosome
algorithm
length
cluster
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710315280.9A
Other languages
Chinese (zh)
Inventor
戴文华
焦翠珍
钱涛
赵君喆
闻彬
江伟
厉阳春
范平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Science and Technology
Original Assignee
Hubei University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Science and Technology filed Critical Hubei University of Science and Technology
Priority to CN201710315280.9A priority Critical patent/CN107038479A/en
Publication of CN107038479A publication Critical patent/CN107038479A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Genetics & Genomics (AREA)
  • Physiology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, belong to data analysis and processing technology field.It solves existing clustering algorithm clusters number and is difficult to determine, cluster influence of the initial center selection to cluster result, and the cluster not high technical problem of Efficiency and accuracy.A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, including the step such as variable length chromosome coding, insert and delete crossover operator, the processing of mutation operator, initialization of population, the design of fitness function.The present invention has the advantages that cluster Efficiency and accuracy height, wide adaptation range.

Description

A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding
Technical field
The invention belongs to data analysis and processing technology field, the hybrid parallel for being related to a kind of variable length chromosome coding is lost Pass clustering algorithm.
Background technology
1st, K-Means algorithms
Traditional K-Means algorithms are a kind of unsupervised learning algorithms of known cluster class number.Its basic thought is: Specify classification number to be K in the algorithm, sample is clustered.Cluster process is right based on K selected at random cluster centre Sample is divided by apart from minimum principle, and iteration updates cluster centre, so that iterative process is to target function value minimum Direction is close, so that the Clustering Effect being optimal.
In K-Means algorithms it is general using formula (1) as object function come the end of explicitly evaluation algorithm whether.
Wherein DIS (Xi,Zj) calculated by formula (2), K is cluster numbers, XiTo belong to class CjCluster sample, ZjIt is poly- Class center.The meaning of object function is actually summation of the Different categories of samples to its central point distance.
Traditional K-Means algorithms are comprised the following steps that:
1. the data set X that size is n is given;
2. K initial cluster center Z is chosenj(j=1,2 ..., K);
3. with ZjData set X is divided by most adjacent principle for reference point, each sample is divided into different clusters;
If Xi, ZjFormula (3) is met, then XiBelong to jth class.
4. cluster centre is adjusted according to formula (4);
Wherein zijRepresent the value of the jth dimension at No. i-th center, niFor class CiMiddle sample point number, XkTo belong to class CiSample Point.xkjFor sample point XkJth dimension value.
5. the value J of formula (1) calculating target function is pressed;
If 6. J value changes little in wheel iteration more, algorithm terminates, and otherwise turns 3..
K-Means clustering algorithms are a kind of important methods in data mining and Knowledge Discovery field, and it has algorithm letter Single, local search ability is strong and the characteristics of fast convergence rate.Exactly these features determine that K-Means algorithms are very suitable for height The clustering problem of dimensional vector.
But among the process using K-means algorithms, if cluster numbers are different with the selection of initial cluster center, all Considerable influence will be produced to cluster result.
In order to solve K-Means algorithms to initial cluster center sensitive issue, people seek a variety of methods to K-Means Algorithm is improved, and these improvement are concentrated mainly on the system of selection of initial center and the reasonable utilization to clustering structure.
In K-Means algorithms, initial cluster center is randomly selected, and this system of selection often makes in same category Sample by forcibly as different classes of cluster centre, so clustering will be deviateed.For reasonable selection Initial cluster center, researcher has carried out many experiments and analysis, and cluster centre is in optimized selection using various methods. Wherein simplest measure is to randomly select different initial values algorithm is performed a plurality of times, and then chooses best result.Also have Person proposes that the thought that average point is separated with cluster seed will be clustered, when progress next round cluster seed is calculated, using in cluster Those data larger with last round of cluster seed similarity, calculate their average point (geometric center point) as next round The seed of cluster.Separately there is scholar to propose a kind of KADD innovatory algorithms based on density and object orientation, take clustering object point Cloth density method determines initial cluster center, then finds the cluster of arbitrary shape according to the cluster direction of object.
In addition, some scholars have found that clustering problem generally has a kind of statistics special by the analysis to Clustering Model Property, referred to as Clustering features.Cluster is represented using Clustering features, more clustering informations can be retained, for improving cluster matter Amount has the parameter (such as cluster centre, sum of squares of deviations and cluster radius) in certain effect, and cluster process can be direct Calculated by Clustering features.Multiple sampling method is exactly a kind of method that use Clustering features carry out clustering.Also scholar Propose a kind of clustering algorithm CFK-Means algorithms of typical use Clustering features.
Although can verify that above-mentioned improved K-Means algorithms have to the performance of traditional K-Means algorithms by experiment Very big raising, but local optimum has simply generally been carried out to K-Means algorithms, it still can not strengthen the global search of algorithm Ability.
, will be to algorithm if K-Means clustering algorithms and the paralleling genetic algorithm to be talked about below can be combined Global optimization ability produce great function, while can be optimized to clustering parameter so that algorithm performance is greatly improved.These Problem is all our problems to be solved in research below.
2nd, the selection of K-Means algorithms initial cluster center
The greatest problem that K-Means algorithms are present is the select permeability of initial cluster center, if can correctly choose poly- Class center, and initial cluster center is optimized, the precision of algorithm will be greatly improved.
The selection of current initial cluster center mainly has following several method:
1. K sample is randomly selected as initial cluster center.
This method is most simple, is also easiest to make algorithm be absorbed in locally optimal solution.
2. representational K sample is rule of thumb chosen as initial cluster center.
The choosing method requirement of this initial cluster center has deeper understanding to the characteristic and basic structure of sample, but Many problems in real work, the characteristic and basic structure of sample data in this case, are somebody's turn to do almost without from understanding Method is clearly impossible.
3. carrying out multiple initial cluster center selection and clustering, one group of optimal initial cluster center is found out.
This method is simple and easy to apply, but when sample data volume is larger, in fact it could happen that combination will be one huge Number, if various situations all tested, it will the substantial amounts of machine time of consumption.Obvious this method is suitable only for sample The less situation of quantity.
4. according to statistical law, carry out many sub-samplings and carry out secondary cluster to obtain initial cluster center.
This method produces new multigroup cluster centre to the sample clustering repeatedly extracted, and these cluster centres are gathered again Class, compares cluster result to obtain optimal initial cluster center.
This initial cluster center optimized algorithm is operated to the subset of very little for giving sample, it is therefore desirable to internal memory Compare the internal memory that whole sample set operated much less, it is adaptable to large-scale clustering problem.But, this algorithm is obtained Simply a kind of cluster result of " suboptimum ", and acceptor sample set chooses the influence of mode.
5. the initial cluster center back-and-forth method based on density.
The algorithm represents point according to sample rate selection and is used as initial cluster center.First using each sample object in The heart, using some given positive number R as radius, a spheric neighbo(u)rhood is marked in feature space, the object fallen into the neighborhood is calculated Number as the point density.Then the maximum object of density is chosen as first initial cluster center, and it correspond to object The peak-peak point of distribution density.Finally, a positive number D is given, is selected leaving first initial cluster center outside D Secondary big density points are represented a little as the 2nd, can so avoid representing point undue concentration.The rest may be inferred, can select K initially Cluster centre.
There is a density radius R and minimum range D determination in this method, to different sample sets, the two values Should be different, it so just can guarantee that the accuracy of cluster.
6. initial cluster center is optimized using genetic algorithm or immune programming algorithm.
This kind of optimization method is using the random searching process of the overall situation of genetic algorithm and immune programming algorithm come to initial clustering Center is optimized.Cluster can relatively accurately be described by the cluster centre after genetic algorithm and immune programming algorithm optimization Characteristic, is a kind of more method of use.
Above-mentioned various methods, there is respective advantage and disadvantage.Generally speaking, wherein based on genetic algorithm and immune programming algorithm Initial cluster center optimized algorithm be a kind of preferably algorithm relatively.But genetic algorithm and immune programming algorithm are easy to occur Local precocity phenomenon, algorithm does not play the concurrency of itself presence yet.In view of the situation, we attempt and proposed using simultaneously Row genetic algorithm is optimized to initial cluster center, forms solve the mixed of initial cluster center select permeability on this basis Merging rows genetic algorithm.
3rd, paralleling genetic algorithm
There is a kind of individual migration strategy for being referred to as " marriage " in paralleling genetic algorithm, using the parallel something lost of " marriage " strategy Propagation algorithm is referred to as the paralleling genetic algorithm based on " marriage " strategy.
Paralleling genetic algorithm based on " marriage " strategy imitates the allied strategy of the mankind, prevents as far as possible with mutually homogenic The individual of structure is mated, to avoid the precocity of algorithm.The algorithm with the individual sub- population parallel evolutions of M (M >=2), when population it Between when meeting marriage condition, contemporary optimum individual marriage two-by-two between different population, and by the optimized individual in marriage offspring Copy to the source population of correlation.In genetic process, in order to retain the excellent genes of chromosome, plan can be retained using optimized individual Slightly, the optimized individual in marriage offspring and source population is compared, retains optimum individual, heredity of future generation is participated in as seed Computing.Concrete model is as shown in Figure 1.
Because marriage offspring carries the gene of other populations, therefore on the one hand allied strategy can keep gene in population Diversity, so that the harm that inbreeding is brought is effectively prevent, simultaneously because the excellent genes of other populations are introduced, thus energy Accelerate the search procedure of algorithm.
3rd, the clustering method based on variable length chromosome hybrid parallel genetic algorithm
Because K-means algorithms employ heuristic in the calculating process of cluster centre, thus it is effectively reduced Algorithm complex, improves arithmetic speed.Exactly because also same the reason for so that choosing of the algorithm to initial cluster center Select more sensitive, it is easy to be absorbed in locally optimal solution.
In order to evade this defect, we analyze and proposed a kind of based on hybrid parallel genetic algorithm (Hybrid Parallel Genetic Algorithm, HPGA) clustering method.Concrete model is as shown in Figure 2.Algorithm combination K- The high efficiency and local search ability of means algorithms, and paralleling genetic algorithm global optimization ability, pass through the something lost in population Biography, variation and the parallel evolution between population, marriage, higher Efficiency and accuracy is provided for sample clustering.
The content of the invention
The purpose of the present invention be for existing technology exist above mentioned problem there is provided a kind of variable length chromosome coding Hybrid parallel genetic algorithm for clustering, the technical problems to be solved by the invention be how during genetic evolution adaptively Clusters number and initial cluster center are obtained, and improves the Efficiency and accuracy of cluster.
The present invention proposes a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, for improving K-Means Algorithm ability of searching optimum, improves the precision of algorithm, as follows:
Paralleling genetic algorithm is a kind of effective genetic algorithm for solving premature problem, and it takes full advantage of genetic algorithm Concurrency, great raising is there has also been in efficiency, the accuracy of algorithm is also guaranteed.But the raising of this efficiency Obtained merely by the concurrency of algorithm, do not account for also and carry out part using heuristic in calculating process Optimizing.
K-Means algorithms are a kind of stronger clustering algorithms of local search ability, and it takes into full account makes in calculating process The excavation of cluster centre point is carried out with heuristic, therefore efficiency of algorithm is very high.However, K-Means algorithm global searches Ability is poor, and the selection for initial cluster center point is more sensitive, therefore the precision of algorithm cannot be guaranteed.
Both are fully combined, both advantages can be really played, obtain to the effective of initial cluster center select permeability Solution.
Hybrid parallel genetic algorithm can give full play to the high efficiency and local search ability of K-Means algorithms, and parallel The concurrency and global optimization ability of genetic algorithm, so as to rapidly and accurately find initial cluster center.
A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, its goal of the invention can pass through following technical side Case is realized:
A, variable length chromosome coding
The basic thought of traditional K-Means algorithms is:In the case where K values are determined, find first in K initial clustering The heart, then assigns to respective classification by each sample dot-dash using closest principle, finally adjusts all kinds of centre coordinates, repeatedly The adjustment process of iteration centre coordinate, algorithm stops when meeting object function.
But the determination of K values is very difficult in actual clustering problem, people can only be leaned on empirically determined.It is this to ask Topic will bring the decline of algorithm accuracy.
The characteristics of by analysing in depth K-Means algorithms and paralleling genetic algorithm, it has been found that be combined at both In K-Means clusters based on paralleling genetic algorithm, paralleling genetic algorithm is played a part of being substantially to seek fitness most High chromosome, and the calculating of fitness is unrelated with chromosome length (clusters number).In mixed model, K-Means algorithms Played a part of mainly carry out sample class divide and cluster centre adjustment, this process actually also with chromosome Length does not conflict, simply to different chromosome, and its cluster result divides different.
According to considerations above, it is proposed that a kind of hybrid parallel heredity of variable length chromosome coding of K values dynamic change Clustering algorithm.By the algorithm, we can dynamically obtain the cluster numbers by optimization while sample clustering purpose is reached Mesh, thus cluster accuracy be also inherently greatly improved.
In variable length chromosome coding, volume of the chromogene by the corresponding sample point of initial cluster center in sample set Number represent, its coding form is:C={ c1,c2,…,ct}。
Wherein t is the code length of certain chromosome, and to different chromosome, t value is the c in changei(i=1, 2 ..., it is t) numbering of the corresponding sample in ith cluster center in sample set, is that (N is for natural number between one [1, N] Number of samples).
For example:Some initial cluster center is made up of sample 3, sample 7, sample 10 and sample 19, then its chromosome coding It is represented by:C1={ 3,7,10,19 }.If any another initial cluster center, by sample 2, sample 10, sample 15, sample 18 with And sample 20 is constituted, then its chromosome coding is represented by:C2={ 2,10,15,18,20 }.
Obviously, two chromosome lengths and differ, chromosome C1Length is 4, there is 4 cluster centres, chromosome C2Length For 5, there are 5 cluster centres.
B, insert and delete crossover operator
We specially devise insert and delete crossover operator, to adapt to the change of chromosome length in parallel genetic evolutionary process Change.
The thought of insert and delete crossover operator is:Inserted by one section of gene elmination of a chromosome, and by this section of gene The a certain position of another chromosome.
Chromosome insert and delete crossover operator is comprised the following steps that:
1. with father's individual CH1As deleted chromosome, with father's individual CH2As chromosome is inserted into, two dyeing are calculated Body CH1And CH2Length t1And t2
If 2.Then reselect chromosome CH2, until
Wherein N is number of samples,To cluster the empirical value of number, the purpose of the value is set to be intended merely to accelerate to calculate Method speed, if requiring higher to arithmetic accuracy, can suitably relax the yardstick of the value.It is required thatIt is to prevent Insertion operation after stain colour solid CH2Gene blocked and unchanged due to overlength.
3. it is random to generate insertion point position Ins, delete point position Del and the length DLen of insertion or deletion;
Wherein intubating length is DLen with deleting equal length.It is required to meet following condition:
0≤Del < t1, 0≤Ins≤t2And DLen < t1
4. by chromosome CH1Since being deleted point, length DLen gene section is deleted, sub- individual CH is obtained1', and will delete The gene section insertion chromosome CH removed2In, obtain conversion body CH2*;
5. by conversion body CH2* the duplicate factor in is removed, and obtains sub- individual CH2
6. such as fruit individual CH2' length overlength, then truncated operation is carried out to it.
After the operation of chromosome insert and delete, the length of chromosome is changed, and this change obviously ensure that something lost The diversity of chromosome during coming into, is conducive to the optimization and search of genetic algorithm.
For the frequency of the dynamic change of this chromosome length, and insert and delete operation, we use dynamic chain The mode of table stores chromosome.This storage mode insert and delete operation is quick, and can dynamic variability with chained list length.
C, mutation operator
The mutation operation step of chromosome is as follows:
1. chromosome length Len is calculated;
2. the natural number C between one [1, Len] is randomly generated, change point number is used as;
3. c=1;
4. the natural number between one and last round of unduplicated [1, Len] is randomly generated, change point is used as;
5. the number r between one [0,1] is randomly generated, if r≤Pm, then turn 6., otherwise directly turn 7.;Wherein PmTo become Different probability.
6. the non-existent natural number in chromosome between one [1, N] is randomly generated, by father's individual at change point Gene with this natural number replace;
7. c=c+1;
If 8. c>C, exits variation, otherwise turns 4..
D, initialization of population
Because chromosome length is variable, therefore its initialization of population method has the Some features of itself.The kind of chromosome Group's initialization is comprised the following steps that:
1. population scale Gsize is set;
2. I=1;
If 3. 4. I≤Gsize, turn, otherwise terminate initialization;
4. it is randomly provided chromosome length
5. the unduplicated natural number between Len [1, N] is randomly generated, item chromosome Ind is formed;
6. judge whether chromosome Ind exists in population, if there is then turning 4., otherwise turn 7.;
7. I=I+1;
8. turn 3..
E, fitness function
Because chromosome uses Variable Length Code, so the number of cluster centre is not fixed, therefore fitness function with The fitness function of block code is otherwise varied.It is defined as follows:
Wherein Len (Ind) is individual Ind chromosome length.Formula is meant that:Sample during calculating is all kinds of arrives such The distance at center, and ask these apart from sum, obtain all kinds of fitness.The fitness sum of all classes is plus 1 and asks reciprocal, Obtain chromosome Ind fitness.
F, algorithm stopping criterion
Evolutionary generation exceedes after maximum genetic algebra GNUM or colony's average fitness value continuous multi-generation heredity still without substantially Change, genetic algorithm stops.
Brief description of the drawings
Fig. 1 is two population marriage paralleling genetic algorithm models in background technology.
Fig. 2 is the hybrid parallel Genetic Algorithm Model based on two population marriage in the present invention.
Fig. 3 is chromosome insert and delete operation chart in embodiment.
Fig. 4 is the individual ratio of elite and cluster accuracy rate graph of a relation in embodiment.
Fig. 5 is the operation example schematic of mutation operator in embodiment.
Embodiment
The following is specific embodiment of the invention and with reference to accompanying drawing, technical scheme is further described, But the present invention is not limited to these embodiments.
As shown in Fig. 2 the present invention proposes a kind of hybrid parallel genetic algorithm, for improving K-Means algorithm global searches Ability, improves the precision of algorithm, as follows:
Paralleling genetic algorithm is a kind of effective genetic algorithm for solving premature problem, and it takes full advantage of genetic algorithm Concurrency, great raising is there has also been in efficiency, the accuracy of algorithm is also guaranteed.But the raising of this efficiency Obtained merely by the concurrency of algorithm, do not account for also and carry out part using heuristic in calculating process Optimizing.
K-Means algorithms are a kind of stronger clustering algorithms of local search ability, and it takes into full account makes in calculating process The excavation of cluster centre point is carried out with heuristic, therefore efficiency of algorithm is very high.However, K-Means algorithm global searches Ability is poor, and the selection for initial cluster center point is more sensitive, therefore the precision of algorithm cannot be guaranteed.
Both are fully combined, both advantages can be really played, obtain to the effective of initial cluster center select permeability Solution.
Hybrid parallel genetic algorithm can give full play to the high efficiency and local search ability of K-Means algorithms, and parallel The concurrency and global optimization ability of genetic algorithm, so as to rapidly and accurately find initial cluster center.
A, variable length chromosome coding
In clustering problem, because cluster centre number is difficult to determine, can only set by rule of thumb, it is this empirically determine it is poly- Calculation can produce deviation to cluster result in class, therefore we determine cluster centre in a dynamic fashion using paralleling genetic algorithm Number.
The gene of chromosome represents by numbering of the corresponding sample point of initial cluster center in sample set, its coding form For:C={ c1,c2,…,ct}。
Wherein t is the code length of certain chromosome, and to different chromosome, t value is the c in changei(i=1, 2 ..., it is t) numbering of the corresponding sample in ith cluster center in sample set, is that (N is for natural number between one [1, N] Number of samples).
For example:Some initial cluster center is made up of sample 3, sample 7, sample 10 and sample 19, then its chromosome coding It is represented by:C1={ 3,7,10,19 }.If any another initial cluster center, by sample 2, sample 10, sample 15, sample 18 with And sample 20 is constituted, then its chromosome coding is represented by:C2={ 2,10,15,18,20 }.
Obviously, two chromosome lengths and differ, chromosome C1Length is 4, there is 4 cluster centres, chromosome C2Length For 5, there are 5 cluster centres.
The basic thought of traditional K-Means algorithms is:In the case where K values are determined, find first in K initial clustering The heart, then assigns to respective classification by each sample dot-dash using closest principle, finally adjusts all kinds of centre coordinates, repeatedly The adjustment process of iteration centre coordinate, algorithm stops when meeting object function.
But the determination of K values is very difficult in actual clustering problem, people can only be leaned on empirically determined.It is this to ask Topic will bring the decline of algorithm accuracy.
The characteristics of by analysing in depth K-Means algorithms and paralleling genetic algorithm, it has been found that be combined at both In K-Means clusters based on paralleling genetic algorithm, paralleling genetic algorithm is played a part of being substantially to seek fitness most High chromosome, and the calculating of fitness is unrelated with chromosome length (clusters number).In mixed model, K-Means algorithms Played a part of mainly carry out sample class divide and cluster centre adjustment, this process actually also with chromosome Length does not conflict, simply to different chromosome, and its cluster result divides different.
According to considerations above, it is proposed that a kind of hybrid parallel heredity of variable length chromosome coding of K values dynamic change Clustering algorithm.By the algorithm, we can obtain the clusters number by optimization while sample clustering purpose is reached, because The accuracy of this cluster is also inherently greatly improved.
B, insert and delete crossover operator
As shown in figure 3, we devise insert and delete crossover operator, it is long to adapt to chromosome in parallel genetic evolutionary process The change of degree.
The thought of insert and delete crossover operator is:Inserted by one section of gene elmination of a chromosome, and by this section of gene The a certain position of another chromosome.
Chromosome insert and delete crossover operator is comprised the following steps that:
1. with father's individual CH1As deleted chromosome, with father's individual CH2As chromosome is inserted into, two dyeing are calculated Body CH1And CH2Length t1And t2
If 2.Then reselect chromosome CH2, until
Wherein N is number of samples,To cluster the empirical value of number, the purpose of the value is set to be intended merely to accelerate to calculate Method speed, if requiring higher to arithmetic accuracy, can suitably relax the yardstick of the value.It is required thatIt is to prevent Insertion operation after stain colour solid CH2Gene blocked and unchanged due to overlength.
3. it is random to generate insertion point position Ins, delete point position Del and the length DLen of insertion or deletion;
Wherein intubating length is DLen with deleting equal length.It is required to meet following condition:
0≤Del < t1, 0≤Ins≤t2And DLen < t1
4. by chromosome CH1Since being deleted point, length DLen gene section is deleted, sub- individual CH is obtained1', and will delete The gene section insertion chromosome CH removed2In, obtain conversion body CH2*;
5. by conversion body CH2* the duplicate factor in is removed, and obtains sub- individual CH2
6. such as fruit individual CH2' length overlength, then truncated operation is carried out to it.
From figure 3, it can be seen that after the operation of chromosome insert and delete, length becomes 5 and 8 by 9 and 6 respectively, this Change obviously ensure that the diversity of chromosome in genetic evolution process, be conducive to the optimization and search of genetic algorithm.
For the frequency of the dynamic change of this chromosome length, and insert and delete operation, we use dynamic chain The mode of table stores chromosome.This storage mode insert and delete operation is quick, and can dynamic variability with chained list length.
C, mutation operator
The mutation operation step of chromosome is as follows:
1. chromosome length Len is calculated;
2. the natural number C between one [1, Len] is randomly generated, change point number is used as;
3. c=1;
4. the natural number between one and last round of unduplicated [1, Len] is randomly generated, change point is used as;
5. the number r between one [0,1] is randomly generated, if r≤Pm, then turn 6., otherwise directly turn 7.;Wherein PmTo become Different probability.
6. the non-existent natural number in chromosome between one [1, N] is randomly generated, by father's individual at change point Gene with this natural number replace;
7. c=c+1;
If 8. c>C, exits variation, otherwise turns 4..
For example:If there is chromosome CH={ 3,6,10,17,5,8 }, change point is respectively 2 and 6, and variation value is respectively 9 Hes 11, then its mutation operation is as shown in Figure 5.
D, initialization of population
Because chromosome length is variable, therefore its initialization of population method has the Some features of itself.The kind of chromosome Group's initialization is comprised the following steps that:
1. population scale Gsize is set;
2. I=1;
If 3. 4. I≤Gsize, turn, otherwise terminate initialization;
4. it is randomly provided chromosome length
5. the unduplicated natural number between Len [1, N] is randomly generated, item chromosome Ind is formed;
6. judge whether chromosome Ind exists in population, if there is then turning 4., otherwise turn 7.;
7. I=I+1;
8. turn 3..
E, fitness function
Because chromosome uses Variable Length Code, so the number of cluster centre is not fixed, therefore fitness function with The fitness function of block code is otherwise varied.It is defined as follows:
Wherein Len (Ind) is individual Ind chromosome length.Formula is meant that:Sample during calculating is all kinds of arrives such The distance at center, and ask these apart from sum, obtain all kinds of fitness.The fitness sum of all classes is plus 1 and asks reciprocal, Obtain chromosome Ind fitness.
F, algorithm stopping criterion
Evolutionary generation exceedes after maximum genetic algebra GNUM or colony's average fitness value continuous multi-generation heredity still without substantially Change, genetic algorithm stops.
Remarks:
Cluster the empirical value of number:Generally the sample set of N number of sample is clustered, cluster numbers not overHereExactly cluster the empirical value of number.
The mode of dynamic link table stores chromosome:Using each chromogene as dynamic link table a node, with The form storage chromosome of chained list.Due to every chromosome length difference, then it represents that the chained list length of chromosome also can be different, body The variable length chromosome coding in this patent is showed.
Maximum genetic algebra GNUM:Generally heredity is after the evolution of certain algebraically, it will tend towards stability, some Genetic evolution process is relatively slow, and maximum genetic algebra GNUM is set in the case, to avoid long-time numerical behavior from influenceing algorithm Efficiency.
The step of this algorithm can be used for carrying out clustering, text cluster to text, is as follows:
1. document sets to be clustered are subjected to candidate feature word identification, obtain candidate feature word set;
2. candidate feature word set is filtered and extracted, obtain new feature word set;
3. by all text representations are represented into new feature set of words in document sets text vector;
4. random selection text is used as chromogene as cluster centre with text numbering.Formed by this way by The genome that length is not waited into population, and constitute another population in the same fashion;
5. two population parallel evolutions.Each population is each selected, intersected and made a variation, then for each in population Individual, text cluster is carried out using K-Means algorithms, is calculated each individual fitness and is retained elite individual.Per in generation, evolves After the completion of, by the individual marriage two-by-two of elite in two populations, and heredity is carried out, intersects and makes a variation, excellent in marriage individual Body is same to be retained as follow-on seed;
6. judge whether heredity reaches stopping criterion, if reached, stop evolving, be transferred to 7., otherwise turn 5.;
7. fitness highest in two populations is found out individual as initial cluster center;
8. K-Means clusters are carried out to document sets with initial cluster center, obtains final cluster result.
Setup Experiments and interpretation of result:
In order to the various key technologies proposed to us are compared and verify carried algorithm and key technology it is feasible Property, We conducted substantial amounts of experiment.Our experiment porch is Windows XP, is developed using Visual C++6.0, Parallel computation is simulated in the way of multithreading.Experiment content and analysis of experimental results are illustrated one by one below.
Experiment parameter is set to:Parallel population number M=2, population scale m=100, maximum evolutionary generation GNUM=100 generations, Crossover probability Pc=0.86, mutation probability Pm=0.02, elite number of individuals Elite=4.
1st, Text Clustering Algorithm performance test
Hybrid parallel genetic algorithm (Hybrid Parallel Genetic Algorithm, HPGA) is based in order to test Text cluster scheme performance, we devise following experiment.
Test an algorithm stability and efficiency test
K-Means clustering algorithms are respectively adopted, it is the CFK-Means algorithms using Clustering features, classical Genetic Algorithms, mixed The HPGA algorithms proposed in hybrid genetic algorithm HGA (K-Means+GA), paralleling genetic algorithm PGA and this book are modern to State Language Work Committee 100 test documents (totally four classes, per class 25) extracted in Chinese data storehouse are clustered, every kind of test of heuristics 50 times.Survey Test result is shown in Table 1.
The algorithm iteration number of times of table 1 and stability test result
From experimental result it can be seen that K-Means algorithms are a kind of quick clustering algorithms, but algorithm stability is poor, This exactly embodiment of K-Means algorithms to initial cluster center dependence;CFK-Means algorithms have larger carry in stability Height, algorithm mean iterative number of time is smaller, is a kind of preferable innovatory algorithm;GA algorithms are better than CFK- in cluster stability Means algorithms, but genetic evolution algebraically is larger, operation time is relatively long;PGA algorithms are by considering the parallel of genetic algorithm Characteristic so that be significantly improved in calculating speed and cluster stability, but it is still to be improved;HGA is a kind of efficient and steady Fixed genetic algorithm, but it does not make full use of the concurrency of genetic algorithm, therefore still not as HPGA algorithms in efficiency; HPGA algorithms are more prominent in calculating speed and efficiency, and stability is more strengthened, and this combines heredity just because of HPGA algorithms The advantage that the concurrency of algorithm and the high efficiency of K-Means algorithms are obtained.
Experiment two clusters accuracy rate test
We have extracted 155 documents from State Language Work Committee's Modern Chinese corpus, wherein computer document 50, day Literary GEOGRAPHIC ATTRIBUTES document 32, law class document 40, energy and material class document 33.Above-mentioned five kinds of algorithms are respectively adopted and enter style of writing This cluster.By investigating whether the generic relation between any two document unanimously evaluates the effect of cluster.Specific evaluation index For Average Accuracy, its calculation formula is as follows:
Aa=(pa+na)/2 (7)
Wherein na, pa are referred to as passive accuracy rate and positive accuracy rate, and calculation formula is as follows:
Na=d/ (b+d), pa=a/ (a+c) (8)
Relation between any two document, can have in table 2 according to the standard of manual sort and the standard of automatic cluster and arrange The 4 kinds of situations gone out:
Generic relation between the document of table 2
A in formula (8), b, c, d computational methods are:If cluster result belongs to the first situation, a is added 1, if Belong to second of situation, then b is added 1, if belonging to the third situation, c is added 1, if belonging to the 4th kind of situation, by d Plus 1.
Experimental result is shown in Table 3.
Table 3 clusters accuracy rate test result
K-Means CFK-Means GA PGA HGA HPGA
Average Accuracy 72% 75% 75% 76% 90% 92%
From experimental result as can be seen that the experiment is more consistent with testing the result in one.Experiment shows K-Means algorithms Cluster accuracy rate relatively low;CFK-Means algorithms improve to some extent with respect to K-Means algorithms;HGA algorithm effects are relative in genetic algorithm Preferably;HPGA is then a kind of accuracy rate highest Text Clustering Algorithm.The error of HPGA algorithms is essentially from feature extraction and side The clustering of boundary's document.
Test the relation between the individual ratio of three elite and cluster accuracy rate
Still 155 documents extracted using experiment two, change elite individual amount, and the HPGA proposed using this book is gathered Class algorithm is clustered to text, obtains the graph of a relation between the individual ratio of elite as shown in Figure 4 and cluster accuracy rate.
It can find that cluster accuracy rate change is little when proportion is less than 8% to elite individual in population from Fig. 4, and Higher level can be kept.But when elite individual amount is too small, algorithm the convergence speed is excessively slow, will influence efficiency of algorithm.When Elite individual in population proportion be more than 8% when, cluster accuracy rate drastically decline.The reason for there is this phenomenon be because When proportion is larger in population for elite individual, it is difficult to keep diversity individual in population, precocious receipts easily occurs in algorithm Hold back, evolutionary process will converge on locally optimal solution, so that cluster result produces relatively large deviation, have impact on cluster accuracy.
By experimental analysis above, illustrate elite individual amount in cluster process selection have to clustering precision it is larger Influence.It is typically chosen elite individual and accounts for the 3% to 6% more suitable of Population Size.
By above-mentioned every experiment and interpretation of result, we are completely it can be concluded that using proposed by the invention variable It is a kind of accurately and efficiently algorithm when long chromosome coding hybrid parallel genetic algorithm carries out text cluster.
This algorithm can be additionally used in the fields such as accident identification, image recognition and the identification of industrial products defect ware, then this is not Illustrate one by one.
Specific embodiment described herein is only to spirit explanation for example of the invention.Technology neck belonging to of the invention The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.

Claims (3)

1. a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, comprises the following steps:
A, variable length chromosome coding
The gene of chromosome represents that its coding form is by numbering of the corresponding sample point of initial cluster center in sample set:C ={ c1,c2,…,ct}。
Wherein t is the code length of certain chromosome, and to different chromosome, t value is the c in changei(i=1,2 ..., t) It is that (N is sample for natural number between one [1, N] for numbering of the corresponding sample in ith cluster center in sample set Number).
B, insert and delete crossover operator
Chromosome insert and delete crossover operator is comprised the following steps that:
1. with father's individual CH1As deleted chromosome, with father's individual CH2As chromosome is inserted into, two chromosomes are calculated CH1And CH2Length t1And t2
If 2.Then reselect chromosome CH2, until
Wherein N is number of samples,To cluster the empirical value of number, the purpose of the value is set to be intended merely to accelerate algorithm speed Degree, if requiring higher to arithmetic accuracy, can suitably relax the yardstick of the value.It is required thatIt is to be grasped to prevent from inserting Make after stain colour solid CH2Gene blocked and unchanged due to overlength.
3. it is random to generate insertion point position Ins, delete point position Del and the length DLen of insertion or deletion;
Wherein intubating length is DLen with deleting equal length.It is required to meet following condition:
0≤Del < t1, 0≤Ins≤t2And DLen < t1
4. by chromosome CH1Since being deleted point, length DLen gene section is deleted, sub- individual CH is obtained1', and by deletion Gene section insertion chromosome CH2In, obtain conversion body CH2*;
5. by conversion body CH2* the duplicate factor in is removed, and obtains sub- individual CH2
6. such as fruit individual CH2' length overlength, then truncated operation is carried out to it.
C, mutation operator processing
The mutation operation step of chromosome is as follows:
1. chromosome length Len is calculated;
2. the natural number C between one [1, Len] is randomly generated, change point number is used as;
3. c=1;
4. the natural number between one and last round of unduplicated [1, Len] is randomly generated, change point is used as;
5. the number r between one [0,1] is randomly generated, if r≤Pm, then turn 6., otherwise directly turn 7.;Wherein PmIt is general for variation Rate.
6. the non-existent natural number in chromosome between one [1, N] is randomly generated, by base of father's individual at change point Because being replaced with this natural number;
7. c=c+1;
If 8. c>C, exits variation, otherwise turns 4..
D, initialization of population
The initialization of population of chromosome is comprised the following steps that:
1. population scale Gsize is set;
2. I=1;
If 3. 4. I≤Gsize, turn, otherwise terminate initialization;
4. it is randomly provided chromosome length
5. the unduplicated natural number between Len [1, N] is randomly generated, item chromosome Ind is formed;
6. judge whether chromosome Ind exists in population, if there is then turning 4., otherwise turn 7.;
7. I=I+1;
8. turn 3..
2. a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding according to claim 1, it is characterised in that The fitness function of shown variable length chromosome is as follows:
3. a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding according to claim 1, it is characterised in that The stopping criterion of this algorithm is:Evolutionary generation exceedes maximum genetic algebra GNUM or colony's average fitness value continuous multi-generation heredity When still unchanged afterwards, this genetic algorithm stops.
CN201710315280.9A 2017-05-08 2017-05-08 A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding Pending CN107038479A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710315280.9A CN107038479A (en) 2017-05-08 2017-05-08 A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710315280.9A CN107038479A (en) 2017-05-08 2017-05-08 A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding

Publications (1)

Publication Number Publication Date
CN107038479A true CN107038479A (en) 2017-08-11

Family

ID=59537001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710315280.9A Pending CN107038479A (en) 2017-05-08 2017-05-08 A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding

Country Status (1)

Country Link
CN (1) CN107038479A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376772A (en) * 2018-09-28 2019-02-22 武汉华喻燃能工程技术有限公司 A kind of Combination power load forecasting method based on neural network model
CN109557300A (en) * 2019-01-17 2019-04-02 湖北中医药高等专科学校 A kind of full-automatic fluoroimmunoassay system and method
CN111209679A (en) * 2020-01-13 2020-05-29 广东工业大学 Genetic algorithm-based soil heavy metal content spatial interpolation method
CN111414849A (en) * 2020-03-19 2020-07-14 四川大学 Face recognition method based on evolution convolutional neural network
CN112949859A (en) * 2021-04-16 2021-06-11 辽宁工程技术大学 Improved genetic clustering algorithm

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376772A (en) * 2018-09-28 2019-02-22 武汉华喻燃能工程技术有限公司 A kind of Combination power load forecasting method based on neural network model
CN109376772B (en) * 2018-09-28 2021-02-23 武汉华喻燃能工程技术有限公司 Power load combination prediction method based on neural network model
CN109557300A (en) * 2019-01-17 2019-04-02 湖北中医药高等专科学校 A kind of full-automatic fluoroimmunoassay system and method
CN111209679A (en) * 2020-01-13 2020-05-29 广东工业大学 Genetic algorithm-based soil heavy metal content spatial interpolation method
CN111209679B (en) * 2020-01-13 2023-09-29 广东工业大学 Genetic algorithm-based spatial interpolation method for heavy metal content in soil
CN111414849A (en) * 2020-03-19 2020-07-14 四川大学 Face recognition method based on evolution convolutional neural network
CN112949859A (en) * 2021-04-16 2021-06-11 辽宁工程技术大学 Improved genetic clustering algorithm

Similar Documents

Publication Publication Date Title
CN107038479A (en) A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding
Della Cioppa et al. Where are the niches? Dynamic fitness sharing
Lobato et al. Multi-objective genetic algorithm for missing data imputation
CN106649275A (en) Relation extraction method based on part-of-speech information and convolutional neural network
Qiao et al. An adaptive hybrid evolutionary immune multi-objective algorithm based on uniform distribution selection
CN104268629B (en) Complex network community detecting method based on prior information and network inherent information
CN106778826A (en) Based on the hereditary Hybrid Clustering Algorithm with preferred Fuzzy C average of self adaptation cellular
CN104239434A (en) Clustering method based on ecological niche genetic algorithm with diverse radius technology
Chang et al. A genetic clustering algorithm using a message-based similarity measure
CN109670037A (en) K-means Text Clustering Method based on topic model and rough set
Hruschka et al. Improving the efficiency of a clustering genetic algorithm
CN104463221A (en) Imbalance sample weighting method suitable for training of support vector machine
CN101324926A (en) Method for selecting characteristic facing to complicated mode classification
CN111079283A (en) Method for processing information saturation unbalanced data
CN109740722A (en) A kind of network representation learning method based on Memetic algorithm
CN106845696B (en) Intelligent optimization water resource configuration method
CN111209939A (en) SVM classification prediction method with intelligent parameter optimization module
CN114742593A (en) Logistics storage center optimal site selection method and system
Zhang et al. A novel method for detecting outlying subspaces in high-dimensional databases using genetic algorithm
Cheng et al. A projection-based split-and-merge clustering algorithm
Wang et al. Research and improvement on K-means clustering algorithm
Bo et al. An improved PAM algorithm for optimizing initial cluster center
CN112183598A (en) Feature selection method based on genetic algorithm
CN112085335A (en) Improved random forest algorithm for power distribution network fault prediction
Liu et al. Distributed database query based on improved genetic algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170811