CN107038479A - A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding - Google Patents
A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding Download PDFInfo
- Publication number
- CN107038479A CN107038479A CN201710315280.9A CN201710315280A CN107038479A CN 107038479 A CN107038479 A CN 107038479A CN 201710315280 A CN201710315280 A CN 201710315280A CN 107038479 A CN107038479 A CN 107038479A
- Authority
- CN
- China
- Prior art keywords
- chromosome
- algorithm
- length
- cluster
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Genetics & Genomics (AREA)
- Physiology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, belong to data analysis and processing technology field.It solves existing clustering algorithm clusters number and is difficult to determine, cluster influence of the initial center selection to cluster result, and the cluster not high technical problem of Efficiency and accuracy.A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, including the step such as variable length chromosome coding, insert and delete crossover operator, the processing of mutation operator, initialization of population, the design of fitness function.The present invention has the advantages that cluster Efficiency and accuracy height, wide adaptation range.
Description
Technical field
The invention belongs to data analysis and processing technology field, the hybrid parallel for being related to a kind of variable length chromosome coding is lost
Pass clustering algorithm.
Background technology
1st, K-Means algorithms
Traditional K-Means algorithms are a kind of unsupervised learning algorithms of known cluster class number.Its basic thought is:
Specify classification number to be K in the algorithm, sample is clustered.Cluster process is right based on K selected at random cluster centre
Sample is divided by apart from minimum principle, and iteration updates cluster centre, so that iterative process is to target function value minimum
Direction is close, so that the Clustering Effect being optimal.
In K-Means algorithms it is general using formula (1) as object function come the end of explicitly evaluation algorithm whether.
Wherein DIS (Xi,Zj) calculated by formula (2), K is cluster numbers, XiTo belong to class CjCluster sample, ZjIt is poly-
Class center.The meaning of object function is actually summation of the Different categories of samples to its central point distance.
Traditional K-Means algorithms are comprised the following steps that:
1. the data set X that size is n is given;
2. K initial cluster center Z is chosenj(j=1,2 ..., K);
3. with ZjData set X is divided by most adjacent principle for reference point, each sample is divided into different clusters;
If Xi, ZjFormula (3) is met, then XiBelong to jth class.
4. cluster centre is adjusted according to formula (4);
Wherein zijRepresent the value of the jth dimension at No. i-th center, niFor class CiMiddle sample point number, XkTo belong to class CiSample
Point.xkjFor sample point XkJth dimension value.
5. the value J of formula (1) calculating target function is pressed;
If 6. J value changes little in wheel iteration more, algorithm terminates, and otherwise turns 3..
K-Means clustering algorithms are a kind of important methods in data mining and Knowledge Discovery field, and it has algorithm letter
Single, local search ability is strong and the characteristics of fast convergence rate.Exactly these features determine that K-Means algorithms are very suitable for height
The clustering problem of dimensional vector.
But among the process using K-means algorithms, if cluster numbers are different with the selection of initial cluster center, all
Considerable influence will be produced to cluster result.
In order to solve K-Means algorithms to initial cluster center sensitive issue, people seek a variety of methods to K-Means
Algorithm is improved, and these improvement are concentrated mainly on the system of selection of initial center and the reasonable utilization to clustering structure.
In K-Means algorithms, initial cluster center is randomly selected, and this system of selection often makes in same category
Sample by forcibly as different classes of cluster centre, so clustering will be deviateed.For reasonable selection
Initial cluster center, researcher has carried out many experiments and analysis, and cluster centre is in optimized selection using various methods.
Wherein simplest measure is to randomly select different initial values algorithm is performed a plurality of times, and then chooses best result.Also have
Person proposes that the thought that average point is separated with cluster seed will be clustered, when progress next round cluster seed is calculated, using in cluster
Those data larger with last round of cluster seed similarity, calculate their average point (geometric center point) as next round
The seed of cluster.Separately there is scholar to propose a kind of KADD innovatory algorithms based on density and object orientation, take clustering object point
Cloth density method determines initial cluster center, then finds the cluster of arbitrary shape according to the cluster direction of object.
In addition, some scholars have found that clustering problem generally has a kind of statistics special by the analysis to Clustering Model
Property, referred to as Clustering features.Cluster is represented using Clustering features, more clustering informations can be retained, for improving cluster matter
Amount has the parameter (such as cluster centre, sum of squares of deviations and cluster radius) in certain effect, and cluster process can be direct
Calculated by Clustering features.Multiple sampling method is exactly a kind of method that use Clustering features carry out clustering.Also scholar
Propose a kind of clustering algorithm CFK-Means algorithms of typical use Clustering features.
Although can verify that above-mentioned improved K-Means algorithms have to the performance of traditional K-Means algorithms by experiment
Very big raising, but local optimum has simply generally been carried out to K-Means algorithms, it still can not strengthen the global search of algorithm
Ability.
, will be to algorithm if K-Means clustering algorithms and the paralleling genetic algorithm to be talked about below can be combined
Global optimization ability produce great function, while can be optimized to clustering parameter so that algorithm performance is greatly improved.These
Problem is all our problems to be solved in research below.
2nd, the selection of K-Means algorithms initial cluster center
The greatest problem that K-Means algorithms are present is the select permeability of initial cluster center, if can correctly choose poly-
Class center, and initial cluster center is optimized, the precision of algorithm will be greatly improved.
The selection of current initial cluster center mainly has following several method:
1. K sample is randomly selected as initial cluster center.
This method is most simple, is also easiest to make algorithm be absorbed in locally optimal solution.
2. representational K sample is rule of thumb chosen as initial cluster center.
The choosing method requirement of this initial cluster center has deeper understanding to the characteristic and basic structure of sample, but
Many problems in real work, the characteristic and basic structure of sample data in this case, are somebody's turn to do almost without from understanding
Method is clearly impossible.
3. carrying out multiple initial cluster center selection and clustering, one group of optimal initial cluster center is found out.
This method is simple and easy to apply, but when sample data volume is larger, in fact it could happen that combination will be one huge
Number, if various situations all tested, it will the substantial amounts of machine time of consumption.Obvious this method is suitable only for sample
The less situation of quantity.
4. according to statistical law, carry out many sub-samplings and carry out secondary cluster to obtain initial cluster center.
This method produces new multigroup cluster centre to the sample clustering repeatedly extracted, and these cluster centres are gathered again
Class, compares cluster result to obtain optimal initial cluster center.
This initial cluster center optimized algorithm is operated to the subset of very little for giving sample, it is therefore desirable to internal memory
Compare the internal memory that whole sample set operated much less, it is adaptable to large-scale clustering problem.But, this algorithm is obtained
Simply a kind of cluster result of " suboptimum ", and acceptor sample set chooses the influence of mode.
5. the initial cluster center back-and-forth method based on density.
The algorithm represents point according to sample rate selection and is used as initial cluster center.First using each sample object in
The heart, using some given positive number R as radius, a spheric neighbo(u)rhood is marked in feature space, the object fallen into the neighborhood is calculated
Number as the point density.Then the maximum object of density is chosen as first initial cluster center, and it correspond to object
The peak-peak point of distribution density.Finally, a positive number D is given, is selected leaving first initial cluster center outside D
Secondary big density points are represented a little as the 2nd, can so avoid representing point undue concentration.The rest may be inferred, can select K initially
Cluster centre.
There is a density radius R and minimum range D determination in this method, to different sample sets, the two values
Should be different, it so just can guarantee that the accuracy of cluster.
6. initial cluster center is optimized using genetic algorithm or immune programming algorithm.
This kind of optimization method is using the random searching process of the overall situation of genetic algorithm and immune programming algorithm come to initial clustering
Center is optimized.Cluster can relatively accurately be described by the cluster centre after genetic algorithm and immune programming algorithm optimization
Characteristic, is a kind of more method of use.
Above-mentioned various methods, there is respective advantage and disadvantage.Generally speaking, wherein based on genetic algorithm and immune programming algorithm
Initial cluster center optimized algorithm be a kind of preferably algorithm relatively.But genetic algorithm and immune programming algorithm are easy to occur
Local precocity phenomenon, algorithm does not play the concurrency of itself presence yet.In view of the situation, we attempt and proposed using simultaneously
Row genetic algorithm is optimized to initial cluster center, forms solve the mixed of initial cluster center select permeability on this basis
Merging rows genetic algorithm.
3rd, paralleling genetic algorithm
There is a kind of individual migration strategy for being referred to as " marriage " in paralleling genetic algorithm, using the parallel something lost of " marriage " strategy
Propagation algorithm is referred to as the paralleling genetic algorithm based on " marriage " strategy.
Paralleling genetic algorithm based on " marriage " strategy imitates the allied strategy of the mankind, prevents as far as possible with mutually homogenic
The individual of structure is mated, to avoid the precocity of algorithm.The algorithm with the individual sub- population parallel evolutions of M (M >=2), when population it
Between when meeting marriage condition, contemporary optimum individual marriage two-by-two between different population, and by the optimized individual in marriage offspring
Copy to the source population of correlation.In genetic process, in order to retain the excellent genes of chromosome, plan can be retained using optimized individual
Slightly, the optimized individual in marriage offspring and source population is compared, retains optimum individual, heredity of future generation is participated in as seed
Computing.Concrete model is as shown in Figure 1.
Because marriage offspring carries the gene of other populations, therefore on the one hand allied strategy can keep gene in population
Diversity, so that the harm that inbreeding is brought is effectively prevent, simultaneously because the excellent genes of other populations are introduced, thus energy
Accelerate the search procedure of algorithm.
3rd, the clustering method based on variable length chromosome hybrid parallel genetic algorithm
Because K-means algorithms employ heuristic in the calculating process of cluster centre, thus it is effectively reduced
Algorithm complex, improves arithmetic speed.Exactly because also same the reason for so that choosing of the algorithm to initial cluster center
Select more sensitive, it is easy to be absorbed in locally optimal solution.
In order to evade this defect, we analyze and proposed a kind of based on hybrid parallel genetic algorithm (Hybrid
Parallel Genetic Algorithm, HPGA) clustering method.Concrete model is as shown in Figure 2.Algorithm combination K-
The high efficiency and local search ability of means algorithms, and paralleling genetic algorithm global optimization ability, pass through the something lost in population
Biography, variation and the parallel evolution between population, marriage, higher Efficiency and accuracy is provided for sample clustering.
The content of the invention
The purpose of the present invention be for existing technology exist above mentioned problem there is provided a kind of variable length chromosome coding
Hybrid parallel genetic algorithm for clustering, the technical problems to be solved by the invention be how during genetic evolution adaptively
Clusters number and initial cluster center are obtained, and improves the Efficiency and accuracy of cluster.
The present invention proposes a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, for improving K-Means
Algorithm ability of searching optimum, improves the precision of algorithm, as follows:
Paralleling genetic algorithm is a kind of effective genetic algorithm for solving premature problem, and it takes full advantage of genetic algorithm
Concurrency, great raising is there has also been in efficiency, the accuracy of algorithm is also guaranteed.But the raising of this efficiency
Obtained merely by the concurrency of algorithm, do not account for also and carry out part using heuristic in calculating process
Optimizing.
K-Means algorithms are a kind of stronger clustering algorithms of local search ability, and it takes into full account makes in calculating process
The excavation of cluster centre point is carried out with heuristic, therefore efficiency of algorithm is very high.However, K-Means algorithm global searches
Ability is poor, and the selection for initial cluster center point is more sensitive, therefore the precision of algorithm cannot be guaranteed.
Both are fully combined, both advantages can be really played, obtain to the effective of initial cluster center select permeability
Solution.
Hybrid parallel genetic algorithm can give full play to the high efficiency and local search ability of K-Means algorithms, and parallel
The concurrency and global optimization ability of genetic algorithm, so as to rapidly and accurately find initial cluster center.
A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, its goal of the invention can pass through following technical side
Case is realized:
A, variable length chromosome coding
The basic thought of traditional K-Means algorithms is:In the case where K values are determined, find first in K initial clustering
The heart, then assigns to respective classification by each sample dot-dash using closest principle, finally adjusts all kinds of centre coordinates, repeatedly
The adjustment process of iteration centre coordinate, algorithm stops when meeting object function.
But the determination of K values is very difficult in actual clustering problem, people can only be leaned on empirically determined.It is this to ask
Topic will bring the decline of algorithm accuracy.
The characteristics of by analysing in depth K-Means algorithms and paralleling genetic algorithm, it has been found that be combined at both
In K-Means clusters based on paralleling genetic algorithm, paralleling genetic algorithm is played a part of being substantially to seek fitness most
High chromosome, and the calculating of fitness is unrelated with chromosome length (clusters number).In mixed model, K-Means algorithms
Played a part of mainly carry out sample class divide and cluster centre adjustment, this process actually also with chromosome
Length does not conflict, simply to different chromosome, and its cluster result divides different.
According to considerations above, it is proposed that a kind of hybrid parallel heredity of variable length chromosome coding of K values dynamic change
Clustering algorithm.By the algorithm, we can dynamically obtain the cluster numbers by optimization while sample clustering purpose is reached
Mesh, thus cluster accuracy be also inherently greatly improved.
In variable length chromosome coding, volume of the chromogene by the corresponding sample point of initial cluster center in sample set
Number represent, its coding form is:C={ c1,c2,…,ct}。
Wherein t is the code length of certain chromosome, and to different chromosome, t value is the c in changei(i=1,
2 ..., it is t) numbering of the corresponding sample in ith cluster center in sample set, is that (N is for natural number between one [1, N]
Number of samples).
For example:Some initial cluster center is made up of sample 3, sample 7, sample 10 and sample 19, then its chromosome coding
It is represented by:C1={ 3,7,10,19 }.If any another initial cluster center, by sample 2, sample 10, sample 15, sample 18 with
And sample 20 is constituted, then its chromosome coding is represented by:C2={ 2,10,15,18,20 }.
Obviously, two chromosome lengths and differ, chromosome C1Length is 4, there is 4 cluster centres, chromosome C2Length
For 5, there are 5 cluster centres.
B, insert and delete crossover operator
We specially devise insert and delete crossover operator, to adapt to the change of chromosome length in parallel genetic evolutionary process
Change.
The thought of insert and delete crossover operator is:Inserted by one section of gene elmination of a chromosome, and by this section of gene
The a certain position of another chromosome.
Chromosome insert and delete crossover operator is comprised the following steps that:
1. with father's individual CH1As deleted chromosome, with father's individual CH2As chromosome is inserted into, two dyeing are calculated
Body CH1And CH2Length t1And t2;
If 2.Then reselect chromosome CH2, until
Wherein N is number of samples,To cluster the empirical value of number, the purpose of the value is set to be intended merely to accelerate to calculate
Method speed, if requiring higher to arithmetic accuracy, can suitably relax the yardstick of the value.It is required thatIt is to prevent
Insertion operation after stain colour solid CH2Gene blocked and unchanged due to overlength.
3. it is random to generate insertion point position Ins, delete point position Del and the length DLen of insertion or deletion;
Wherein intubating length is DLen with deleting equal length.It is required to meet following condition:
0≤Del < t1, 0≤Ins≤t2And DLen < t1
4. by chromosome CH1Since being deleted point, length DLen gene section is deleted, sub- individual CH is obtained1', and will delete
The gene section insertion chromosome CH removed2In, obtain conversion body CH2*;
5. by conversion body CH2* the duplicate factor in is removed, and obtains sub- individual CH2′
6. such as fruit individual CH2' length overlength, then truncated operation is carried out to it.
After the operation of chromosome insert and delete, the length of chromosome is changed, and this change obviously ensure that something lost
The diversity of chromosome during coming into, is conducive to the optimization and search of genetic algorithm.
For the frequency of the dynamic change of this chromosome length, and insert and delete operation, we use dynamic chain
The mode of table stores chromosome.This storage mode insert and delete operation is quick, and can dynamic variability with chained list length.
C, mutation operator
The mutation operation step of chromosome is as follows:
1. chromosome length Len is calculated;
2. the natural number C between one [1, Len] is randomly generated, change point number is used as;
3. c=1;
4. the natural number between one and last round of unduplicated [1, Len] is randomly generated, change point is used as;
5. the number r between one [0,1] is randomly generated, if r≤Pm, then turn 6., otherwise directly turn 7.;Wherein PmTo become
Different probability.
6. the non-existent natural number in chromosome between one [1, N] is randomly generated, by father's individual at change point
Gene with this natural number replace;
7. c=c+1;
If 8. c>C, exits variation, otherwise turns 4..
D, initialization of population
Because chromosome length is variable, therefore its initialization of population method has the Some features of itself.The kind of chromosome
Group's initialization is comprised the following steps that:
1. population scale Gsize is set;
2. I=1;
If 3. 4. I≤Gsize, turn, otherwise terminate initialization;
4. it is randomly provided chromosome length
5. the unduplicated natural number between Len [1, N] is randomly generated, item chromosome Ind is formed;
6. judge whether chromosome Ind exists in population, if there is then turning 4., otherwise turn 7.;
7. I=I+1;
8. turn 3..
E, fitness function
Because chromosome uses Variable Length Code, so the number of cluster centre is not fixed, therefore fitness function with
The fitness function of block code is otherwise varied.It is defined as follows:
Wherein Len (Ind) is individual Ind chromosome length.Formula is meant that:Sample during calculating is all kinds of arrives such
The distance at center, and ask these apart from sum, obtain all kinds of fitness.The fitness sum of all classes is plus 1 and asks reciprocal,
Obtain chromosome Ind fitness.
F, algorithm stopping criterion
Evolutionary generation exceedes after maximum genetic algebra GNUM or colony's average fitness value continuous multi-generation heredity still without substantially
Change, genetic algorithm stops.
Brief description of the drawings
Fig. 1 is two population marriage paralleling genetic algorithm models in background technology.
Fig. 2 is the hybrid parallel Genetic Algorithm Model based on two population marriage in the present invention.
Fig. 3 is chromosome insert and delete operation chart in embodiment.
Fig. 4 is the individual ratio of elite and cluster accuracy rate graph of a relation in embodiment.
Fig. 5 is the operation example schematic of mutation operator in embodiment.
Embodiment
The following is specific embodiment of the invention and with reference to accompanying drawing, technical scheme is further described,
But the present invention is not limited to these embodiments.
As shown in Fig. 2 the present invention proposes a kind of hybrid parallel genetic algorithm, for improving K-Means algorithm global searches
Ability, improves the precision of algorithm, as follows:
Paralleling genetic algorithm is a kind of effective genetic algorithm for solving premature problem, and it takes full advantage of genetic algorithm
Concurrency, great raising is there has also been in efficiency, the accuracy of algorithm is also guaranteed.But the raising of this efficiency
Obtained merely by the concurrency of algorithm, do not account for also and carry out part using heuristic in calculating process
Optimizing.
K-Means algorithms are a kind of stronger clustering algorithms of local search ability, and it takes into full account makes in calculating process
The excavation of cluster centre point is carried out with heuristic, therefore efficiency of algorithm is very high.However, K-Means algorithm global searches
Ability is poor, and the selection for initial cluster center point is more sensitive, therefore the precision of algorithm cannot be guaranteed.
Both are fully combined, both advantages can be really played, obtain to the effective of initial cluster center select permeability
Solution.
Hybrid parallel genetic algorithm can give full play to the high efficiency and local search ability of K-Means algorithms, and parallel
The concurrency and global optimization ability of genetic algorithm, so as to rapidly and accurately find initial cluster center.
A, variable length chromosome coding
In clustering problem, because cluster centre number is difficult to determine, can only set by rule of thumb, it is this empirically determine it is poly-
Calculation can produce deviation to cluster result in class, therefore we determine cluster centre in a dynamic fashion using paralleling genetic algorithm
Number.
The gene of chromosome represents by numbering of the corresponding sample point of initial cluster center in sample set, its coding form
For:C={ c1,c2,…,ct}。
Wherein t is the code length of certain chromosome, and to different chromosome, t value is the c in changei(i=1,
2 ..., it is t) numbering of the corresponding sample in ith cluster center in sample set, is that (N is for natural number between one [1, N]
Number of samples).
For example:Some initial cluster center is made up of sample 3, sample 7, sample 10 and sample 19, then its chromosome coding
It is represented by:C1={ 3,7,10,19 }.If any another initial cluster center, by sample 2, sample 10, sample 15, sample 18 with
And sample 20 is constituted, then its chromosome coding is represented by:C2={ 2,10,15,18,20 }.
Obviously, two chromosome lengths and differ, chromosome C1Length is 4, there is 4 cluster centres, chromosome C2Length
For 5, there are 5 cluster centres.
The basic thought of traditional K-Means algorithms is:In the case where K values are determined, find first in K initial clustering
The heart, then assigns to respective classification by each sample dot-dash using closest principle, finally adjusts all kinds of centre coordinates, repeatedly
The adjustment process of iteration centre coordinate, algorithm stops when meeting object function.
But the determination of K values is very difficult in actual clustering problem, people can only be leaned on empirically determined.It is this to ask
Topic will bring the decline of algorithm accuracy.
The characteristics of by analysing in depth K-Means algorithms and paralleling genetic algorithm, it has been found that be combined at both
In K-Means clusters based on paralleling genetic algorithm, paralleling genetic algorithm is played a part of being substantially to seek fitness most
High chromosome, and the calculating of fitness is unrelated with chromosome length (clusters number).In mixed model, K-Means algorithms
Played a part of mainly carry out sample class divide and cluster centre adjustment, this process actually also with chromosome
Length does not conflict, simply to different chromosome, and its cluster result divides different.
According to considerations above, it is proposed that a kind of hybrid parallel heredity of variable length chromosome coding of K values dynamic change
Clustering algorithm.By the algorithm, we can obtain the clusters number by optimization while sample clustering purpose is reached, because
The accuracy of this cluster is also inherently greatly improved.
B, insert and delete crossover operator
As shown in figure 3, we devise insert and delete crossover operator, it is long to adapt to chromosome in parallel genetic evolutionary process
The change of degree.
The thought of insert and delete crossover operator is:Inserted by one section of gene elmination of a chromosome, and by this section of gene
The a certain position of another chromosome.
Chromosome insert and delete crossover operator is comprised the following steps that:
1. with father's individual CH1As deleted chromosome, with father's individual CH2As chromosome is inserted into, two dyeing are calculated
Body CH1And CH2Length t1And t2;
If 2.Then reselect chromosome CH2, until
Wherein N is number of samples,To cluster the empirical value of number, the purpose of the value is set to be intended merely to accelerate to calculate
Method speed, if requiring higher to arithmetic accuracy, can suitably relax the yardstick of the value.It is required thatIt is to prevent
Insertion operation after stain colour solid CH2Gene blocked and unchanged due to overlength.
3. it is random to generate insertion point position Ins, delete point position Del and the length DLen of insertion or deletion;
Wherein intubating length is DLen with deleting equal length.It is required to meet following condition:
0≤Del < t1, 0≤Ins≤t2And DLen < t1
4. by chromosome CH1Since being deleted point, length DLen gene section is deleted, sub- individual CH is obtained1', and will delete
The gene section insertion chromosome CH removed2In, obtain conversion body CH2*;
5. by conversion body CH2* the duplicate factor in is removed, and obtains sub- individual CH2′
6. such as fruit individual CH2' length overlength, then truncated operation is carried out to it.
From figure 3, it can be seen that after the operation of chromosome insert and delete, length becomes 5 and 8 by 9 and 6 respectively, this
Change obviously ensure that the diversity of chromosome in genetic evolution process, be conducive to the optimization and search of genetic algorithm.
For the frequency of the dynamic change of this chromosome length, and insert and delete operation, we use dynamic chain
The mode of table stores chromosome.This storage mode insert and delete operation is quick, and can dynamic variability with chained list length.
C, mutation operator
The mutation operation step of chromosome is as follows:
1. chromosome length Len is calculated;
2. the natural number C between one [1, Len] is randomly generated, change point number is used as;
3. c=1;
4. the natural number between one and last round of unduplicated [1, Len] is randomly generated, change point is used as;
5. the number r between one [0,1] is randomly generated, if r≤Pm, then turn 6., otherwise directly turn 7.;Wherein PmTo become
Different probability.
6. the non-existent natural number in chromosome between one [1, N] is randomly generated, by father's individual at change point
Gene with this natural number replace;
7. c=c+1;
If 8. c>C, exits variation, otherwise turns 4..
For example:If there is chromosome CH={ 3,6,10,17,5,8 }, change point is respectively 2 and 6, and variation value is respectively 9 Hes
11, then its mutation operation is as shown in Figure 5.
D, initialization of population
Because chromosome length is variable, therefore its initialization of population method has the Some features of itself.The kind of chromosome
Group's initialization is comprised the following steps that:
1. population scale Gsize is set;
2. I=1;
If 3. 4. I≤Gsize, turn, otherwise terminate initialization;
4. it is randomly provided chromosome length
5. the unduplicated natural number between Len [1, N] is randomly generated, item chromosome Ind is formed;
6. judge whether chromosome Ind exists in population, if there is then turning 4., otherwise turn 7.;
7. I=I+1;
8. turn 3..
E, fitness function
Because chromosome uses Variable Length Code, so the number of cluster centre is not fixed, therefore fitness function with
The fitness function of block code is otherwise varied.It is defined as follows:
Wherein Len (Ind) is individual Ind chromosome length.Formula is meant that:Sample during calculating is all kinds of arrives such
The distance at center, and ask these apart from sum, obtain all kinds of fitness.The fitness sum of all classes is plus 1 and asks reciprocal,
Obtain chromosome Ind fitness.
F, algorithm stopping criterion
Evolutionary generation exceedes after maximum genetic algebra GNUM or colony's average fitness value continuous multi-generation heredity still without substantially
Change, genetic algorithm stops.
Remarks:
Cluster the empirical value of number:Generally the sample set of N number of sample is clustered, cluster numbers not overHereExactly cluster the empirical value of number.
The mode of dynamic link table stores chromosome:Using each chromogene as dynamic link table a node, with
The form storage chromosome of chained list.Due to every chromosome length difference, then it represents that the chained list length of chromosome also can be different, body
The variable length chromosome coding in this patent is showed.
Maximum genetic algebra GNUM:Generally heredity is after the evolution of certain algebraically, it will tend towards stability, some
Genetic evolution process is relatively slow, and maximum genetic algebra GNUM is set in the case, to avoid long-time numerical behavior from influenceing algorithm
Efficiency.
The step of this algorithm can be used for carrying out clustering, text cluster to text, is as follows:
1. document sets to be clustered are subjected to candidate feature word identification, obtain candidate feature word set;
2. candidate feature word set is filtered and extracted, obtain new feature word set;
3. by all text representations are represented into new feature set of words in document sets text vector;
4. random selection text is used as chromogene as cluster centre with text numbering.Formed by this way by
The genome that length is not waited into population, and constitute another population in the same fashion;
5. two population parallel evolutions.Each population is each selected, intersected and made a variation, then for each in population
Individual, text cluster is carried out using K-Means algorithms, is calculated each individual fitness and is retained elite individual.Per in generation, evolves
After the completion of, by the individual marriage two-by-two of elite in two populations, and heredity is carried out, intersects and makes a variation, excellent in marriage individual
Body is same to be retained as follow-on seed;
6. judge whether heredity reaches stopping criterion, if reached, stop evolving, be transferred to 7., otherwise turn 5.;
7. fitness highest in two populations is found out individual as initial cluster center;
8. K-Means clusters are carried out to document sets with initial cluster center, obtains final cluster result.
Setup Experiments and interpretation of result:
In order to the various key technologies proposed to us are compared and verify carried algorithm and key technology it is feasible
Property, We conducted substantial amounts of experiment.Our experiment porch is Windows XP, is developed using Visual C++6.0,
Parallel computation is simulated in the way of multithreading.Experiment content and analysis of experimental results are illustrated one by one below.
Experiment parameter is set to:Parallel population number M=2, population scale m=100, maximum evolutionary generation GNUM=100 generations,
Crossover probability Pc=0.86, mutation probability Pm=0.02, elite number of individuals Elite=4.
1st, Text Clustering Algorithm performance test
Hybrid parallel genetic algorithm (Hybrid Parallel Genetic Algorithm, HPGA) is based in order to test
Text cluster scheme performance, we devise following experiment.
Test an algorithm stability and efficiency test
K-Means clustering algorithms are respectively adopted, it is the CFK-Means algorithms using Clustering features, classical Genetic Algorithms, mixed
The HPGA algorithms proposed in hybrid genetic algorithm HGA (K-Means+GA), paralleling genetic algorithm PGA and this book are modern to State Language Work Committee
100 test documents (totally four classes, per class 25) extracted in Chinese data storehouse are clustered, every kind of test of heuristics 50 times.Survey
Test result is shown in Table 1.
The algorithm iteration number of times of table 1 and stability test result
From experimental result it can be seen that K-Means algorithms are a kind of quick clustering algorithms, but algorithm stability is poor,
This exactly embodiment of K-Means algorithms to initial cluster center dependence;CFK-Means algorithms have larger carry in stability
Height, algorithm mean iterative number of time is smaller, is a kind of preferable innovatory algorithm;GA algorithms are better than CFK- in cluster stability
Means algorithms, but genetic evolution algebraically is larger, operation time is relatively long;PGA algorithms are by considering the parallel of genetic algorithm
Characteristic so that be significantly improved in calculating speed and cluster stability, but it is still to be improved;HGA is a kind of efficient and steady
Fixed genetic algorithm, but it does not make full use of the concurrency of genetic algorithm, therefore still not as HPGA algorithms in efficiency;
HPGA algorithms are more prominent in calculating speed and efficiency, and stability is more strengthened, and this combines heredity just because of HPGA algorithms
The advantage that the concurrency of algorithm and the high efficiency of K-Means algorithms are obtained.
Experiment two clusters accuracy rate test
We have extracted 155 documents from State Language Work Committee's Modern Chinese corpus, wherein computer document 50, day
Literary GEOGRAPHIC ATTRIBUTES document 32, law class document 40, energy and material class document 33.Above-mentioned five kinds of algorithms are respectively adopted and enter style of writing
This cluster.By investigating whether the generic relation between any two document unanimously evaluates the effect of cluster.Specific evaluation index
For Average Accuracy, its calculation formula is as follows:
Aa=(pa+na)/2 (7)
Wherein na, pa are referred to as passive accuracy rate and positive accuracy rate, and calculation formula is as follows:
Na=d/ (b+d), pa=a/ (a+c) (8)
Relation between any two document, can have in table 2 according to the standard of manual sort and the standard of automatic cluster and arrange
The 4 kinds of situations gone out:
Generic relation between the document of table 2
A in formula (8), b, c, d computational methods are:If cluster result belongs to the first situation, a is added 1, if
Belong to second of situation, then b is added 1, if belonging to the third situation, c is added 1, if belonging to the 4th kind of situation, by d
Plus 1.
Experimental result is shown in Table 3.
Table 3 clusters accuracy rate test result
K-Means | CFK-Means | GA | PGA | HGA | HPGA | |
Average Accuracy | 72% | 75% | 75% | 76% | 90% | 92% |
From experimental result as can be seen that the experiment is more consistent with testing the result in one.Experiment shows K-Means algorithms
Cluster accuracy rate relatively low;CFK-Means algorithms improve to some extent with respect to K-Means algorithms;HGA algorithm effects are relative in genetic algorithm
Preferably;HPGA is then a kind of accuracy rate highest Text Clustering Algorithm.The error of HPGA algorithms is essentially from feature extraction and side
The clustering of boundary's document.
Test the relation between the individual ratio of three elite and cluster accuracy rate
Still 155 documents extracted using experiment two, change elite individual amount, and the HPGA proposed using this book is gathered
Class algorithm is clustered to text, obtains the graph of a relation between the individual ratio of elite as shown in Figure 4 and cluster accuracy rate.
It can find that cluster accuracy rate change is little when proportion is less than 8% to elite individual in population from Fig. 4, and
Higher level can be kept.But when elite individual amount is too small, algorithm the convergence speed is excessively slow, will influence efficiency of algorithm.When
Elite individual in population proportion be more than 8% when, cluster accuracy rate drastically decline.The reason for there is this phenomenon be because
When proportion is larger in population for elite individual, it is difficult to keep diversity individual in population, precocious receipts easily occurs in algorithm
Hold back, evolutionary process will converge on locally optimal solution, so that cluster result produces relatively large deviation, have impact on cluster accuracy.
By experimental analysis above, illustrate elite individual amount in cluster process selection have to clustering precision it is larger
Influence.It is typically chosen elite individual and accounts for the 3% to 6% more suitable of Population Size.
By above-mentioned every experiment and interpretation of result, we are completely it can be concluded that using proposed by the invention variable
It is a kind of accurately and efficiently algorithm when long chromosome coding hybrid parallel genetic algorithm carries out text cluster.
This algorithm can be additionally used in the fields such as accident identification, image recognition and the identification of industrial products defect ware, then this is not
Illustrate one by one.
Specific embodiment described herein is only to spirit explanation for example of the invention.Technology neck belonging to of the invention
The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode
Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.
Claims (3)
1. a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding, comprises the following steps:
A, variable length chromosome coding
The gene of chromosome represents that its coding form is by numbering of the corresponding sample point of initial cluster center in sample set:C
={ c1,c2,…,ct}。
Wherein t is the code length of certain chromosome, and to different chromosome, t value is the c in changei(i=1,2 ..., t)
It is that (N is sample for natural number between one [1, N] for numbering of the corresponding sample in ith cluster center in sample set
Number).
B, insert and delete crossover operator
Chromosome insert and delete crossover operator is comprised the following steps that:
1. with father's individual CH1As deleted chromosome, with father's individual CH2As chromosome is inserted into, two chromosomes are calculated
CH1And CH2Length t1And t2;
If 2.Then reselect chromosome CH2, until
Wherein N is number of samples,To cluster the empirical value of number, the purpose of the value is set to be intended merely to accelerate algorithm speed
Degree, if requiring higher to arithmetic accuracy, can suitably relax the yardstick of the value.It is required thatIt is to be grasped to prevent from inserting
Make after stain colour solid CH2Gene blocked and unchanged due to overlength.
3. it is random to generate insertion point position Ins, delete point position Del and the length DLen of insertion or deletion;
Wherein intubating length is DLen with deleting equal length.It is required to meet following condition:
0≤Del < t1, 0≤Ins≤t2And DLen < t1
4. by chromosome CH1Since being deleted point, length DLen gene section is deleted, sub- individual CH is obtained1', and by deletion
Gene section insertion chromosome CH2In, obtain conversion body CH2*;
5. by conversion body CH2* the duplicate factor in is removed, and obtains sub- individual CH2′
6. such as fruit individual CH2' length overlength, then truncated operation is carried out to it.
C, mutation operator processing
The mutation operation step of chromosome is as follows:
1. chromosome length Len is calculated;
2. the natural number C between one [1, Len] is randomly generated, change point number is used as;
3. c=1;
4. the natural number between one and last round of unduplicated [1, Len] is randomly generated, change point is used as;
5. the number r between one [0,1] is randomly generated, if r≤Pm, then turn 6., otherwise directly turn 7.;Wherein PmIt is general for variation
Rate.
6. the non-existent natural number in chromosome between one [1, N] is randomly generated, by base of father's individual at change point
Because being replaced with this natural number;
7. c=c+1;
If 8. c>C, exits variation, otherwise turns 4..
D, initialization of population
The initialization of population of chromosome is comprised the following steps that:
1. population scale Gsize is set;
2. I=1;
If 3. 4. I≤Gsize, turn, otherwise terminate initialization;
4. it is randomly provided chromosome length
5. the unduplicated natural number between Len [1, N] is randomly generated, item chromosome Ind is formed;
6. judge whether chromosome Ind exists in population, if there is then turning 4., otherwise turn 7.;
7. I=I+1;
8. turn 3..
2. a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding according to claim 1, it is characterised in that
The fitness function of shown variable length chromosome is as follows:
3. a kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding according to claim 1, it is characterised in that
The stopping criterion of this algorithm is:Evolutionary generation exceedes maximum genetic algebra GNUM or colony's average fitness value continuous multi-generation heredity
When still unchanged afterwards, this genetic algorithm stops.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710315280.9A CN107038479A (en) | 2017-05-08 | 2017-05-08 | A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710315280.9A CN107038479A (en) | 2017-05-08 | 2017-05-08 | A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107038479A true CN107038479A (en) | 2017-08-11 |
Family
ID=59537001
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710315280.9A Pending CN107038479A (en) | 2017-05-08 | 2017-05-08 | A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107038479A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109376772A (en) * | 2018-09-28 | 2019-02-22 | 武汉华喻燃能工程技术有限公司 | A kind of Combination power load forecasting method based on neural network model |
CN109557300A (en) * | 2019-01-17 | 2019-04-02 | 湖北中医药高等专科学校 | A kind of full-automatic fluoroimmunoassay system and method |
CN111209679A (en) * | 2020-01-13 | 2020-05-29 | 广东工业大学 | Genetic algorithm-based soil heavy metal content spatial interpolation method |
CN111414849A (en) * | 2020-03-19 | 2020-07-14 | 四川大学 | Face recognition method based on evolution convolutional neural network |
CN112949859A (en) * | 2021-04-16 | 2021-06-11 | 辽宁工程技术大学 | Improved genetic clustering algorithm |
-
2017
- 2017-05-08 CN CN201710315280.9A patent/CN107038479A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109376772A (en) * | 2018-09-28 | 2019-02-22 | 武汉华喻燃能工程技术有限公司 | A kind of Combination power load forecasting method based on neural network model |
CN109376772B (en) * | 2018-09-28 | 2021-02-23 | 武汉华喻燃能工程技术有限公司 | Power load combination prediction method based on neural network model |
CN109557300A (en) * | 2019-01-17 | 2019-04-02 | 湖北中医药高等专科学校 | A kind of full-automatic fluoroimmunoassay system and method |
CN111209679A (en) * | 2020-01-13 | 2020-05-29 | 广东工业大学 | Genetic algorithm-based soil heavy metal content spatial interpolation method |
CN111209679B (en) * | 2020-01-13 | 2023-09-29 | 广东工业大学 | Genetic algorithm-based spatial interpolation method for heavy metal content in soil |
CN111414849A (en) * | 2020-03-19 | 2020-07-14 | 四川大学 | Face recognition method based on evolution convolutional neural network |
CN112949859A (en) * | 2021-04-16 | 2021-06-11 | 辽宁工程技术大学 | Improved genetic clustering algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107038479A (en) | A kind of hybrid parallel genetic algorithm for clustering of variable length chromosome coding | |
Della Cioppa et al. | Where are the niches? Dynamic fitness sharing | |
Lobato et al. | Multi-objective genetic algorithm for missing data imputation | |
CN106649275A (en) | Relation extraction method based on part-of-speech information and convolutional neural network | |
Qiao et al. | An adaptive hybrid evolutionary immune multi-objective algorithm based on uniform distribution selection | |
CN104268629B (en) | Complex network community detecting method based on prior information and network inherent information | |
CN106778826A (en) | Based on the hereditary Hybrid Clustering Algorithm with preferred Fuzzy C average of self adaptation cellular | |
CN104239434A (en) | Clustering method based on ecological niche genetic algorithm with diverse radius technology | |
Chang et al. | A genetic clustering algorithm using a message-based similarity measure | |
CN109670037A (en) | K-means Text Clustering Method based on topic model and rough set | |
Hruschka et al. | Improving the efficiency of a clustering genetic algorithm | |
CN104463221A (en) | Imbalance sample weighting method suitable for training of support vector machine | |
CN101324926A (en) | Method for selecting characteristic facing to complicated mode classification | |
CN111079283A (en) | Method for processing information saturation unbalanced data | |
CN109740722A (en) | A kind of network representation learning method based on Memetic algorithm | |
CN106845696B (en) | Intelligent optimization water resource configuration method | |
CN111209939A (en) | SVM classification prediction method with intelligent parameter optimization module | |
CN114742593A (en) | Logistics storage center optimal site selection method and system | |
Zhang et al. | A novel method for detecting outlying subspaces in high-dimensional databases using genetic algorithm | |
Cheng et al. | A projection-based split-and-merge clustering algorithm | |
Wang et al. | Research and improvement on K-means clustering algorithm | |
Bo et al. | An improved PAM algorithm for optimizing initial cluster center | |
CN112183598A (en) | Feature selection method based on genetic algorithm | |
CN112085335A (en) | Improved random forest algorithm for power distribution network fault prediction | |
Liu et al. | Distributed database query based on improved genetic algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170811 |