Skip to content

A simple genetic algorithm for finding consensus binding sties in DNA sequences in Drosophila

License

Notifications You must be signed in to change notification settings

tbrunetti/ConSeGA

Repository files navigation

ConSeGA - Consensus Seqeunce-based Genetic Algorithm


A genetic algorithm used to identify regions of regularity in the genome of Drosophila melanogaster. This algorithm has been tested on previously published ChIP data for the transcription factor, Myocyte-enhancer factor 2 (MEF2).

  1. Introduction and Theory

  2. How does it work?

  3. Software Dependencies

Introduction

Genetic algorithms (GA) are algorithms in which the basic principles behind Darwinian evolution are applied algorithmically to a computational probelm of interest. This approach can generate results by simply applying the idea of "survival of the fittest".

  1. Chromosomes store genetic information
  2. Individuals in a population that are deemed “most genetically fit” prevail
  3. Those that do not, either die off or undergo mutation/crossover
  4. Those that are the most elite continue their lineage in the next generation
  5. This process is repeated for several generations, until the population evolves and equilibrium is reached

These five concepts can be applied programmatically to many scenarios in order to solve complex problems. Some examples of where genetic algorithms have been very successful are (but not limited to):

What is the theory behind ConSeGA and how does it work?

ConSeGA is a genetic algorithm approach for detecting consensus sequences in biological data sets.

  • The algorithm defines the set of set of "individuals" or "chromosomes" as: a set of strings of 15 characters in length comprised of only A, T, G, and C. In the case of the algorithm this was derived from MEF2-ChIP data ().

  • The method to determine how "fit" an individual is, or fitness function, is determed by the window of string with a sum of the 10 lowest consecutive entopy values. Here is the entropy sliding window calculation:

It is then followed by the overall sum of entropies sliding window calculation:

  • The most fit window of strings is automatically excluded from mutation.

  • Any string that is not in the most fit windows is subject to random mutation. A mutation in this case, is a random position shift in the alignment array of +/-3 base pairs.

  • Elitism is use to ensure the most fit window of string is automatically incorporated into the next generation of string alignments.

  • This process is repeated or "evolved" for 3000 generations.

Software Dependencies

ConSeGA was written in Python version 2.7. The following Python dependecies are required:

  • NumPy
  • Matplotlib

Running ConSeGA

The main python function to run ConSeGA is tf_genetic_algorithm_multFunc_version2.py. It requires an input text file of "chromsomes" which are kmer sequences. There will be a command line option of arguments, in order to update generations, array size, kmer length, and input file location, however, for the moment this file is hardcoded to mimic what we have validated. These parameters can be updated within the code at line 23 for your input file and the size of kmers, arrays, and generations can also be updated in the code and then run.

python2.7 tf_genetic_algorithm_multFunc_version2.py

About

A simple genetic algorithm for finding consensus binding sties in DNA sequences in Drosophila

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages