Skip to content

Multiple Sequence Alignment Comparison Statistic - Fractionalization

License

Notifications You must be signed in to change notification settings

yongze-yin/Fract-Calculator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multiple Sequence Alignment Comparison Statistic - Fractionalization

Fractionalization is a value commonly used in socioeconomics measuring how a society can be divived into different groups accoding to a given dimension such as a race, a religion, or a language. We, taking the concept of fractionalization and applying it into genome alignment comparison, implemented this script which takes two whole genome alignment files as input, and outputs the fractionalization values which quantify how congruent the two alignment files are.

A fractionalization value is generated for each multiple sequence alignment, calculated as

$$ F = 1 - \sum_{i=1}^n s_i^2 $$

where $F$ is the fractionalization value, and the target alignment is distributed in $n$ alignments in the other alignment file, and $s_i$ is the overlapping ratio of the $ith$ alignment with the target alignment normalized over the length of the target alignment. A fractionalization value ranges from 0-1 which can be interpreted as given a base pair in a targeted alignment, what is the probability that this base pair falls into different alignments in the comparing alignment file. In other words, the more fractionalizations close to 0, the more congruent the two input multiple sequence alignment files are.


Fract Illustration


Fract Example

Now, the script supports maf, mhg, xmfa file format generated by Cactus, Mauve, MHG/MHG-EVO, and SibeliaZ. For cactus output, which is in hal format, in order to calculate the fractionalization value set, please convert the hal output to maf format using hal2maf provided in Cactus. For the output, this script will output a tsv file for each of the input alignment file containing the length of each alignment, the number of sequence in each alignment, and the fractionalization value corresponding to the alignment comparing with the other alignment file.

usage: frac_calculate [-h] -g GENOME -a ALIGNMENTA -b ALIGNMENTB -at {mhg,mauve,cactus,sibeliaz} -bt
                      {mhg,mauve,cactus,sibeliaz} [-gb GENOME_BED_OUTPUT] [-pa PREFIXA] [-pb PREFIXB] [-t THRESHOLD]

Fractionalization Calculation to Compare Two Whole Genome Alignment Results

optional arguments:
  -h, --help            show this help message and exit
  -g GENOME, --genome GENOME
                        A directory of input fasta genomes for whole genome aligners, will read in all files ended
                        with '.fna', '.fa', '.fasta'
  -a ALIGNMENTA, --alignmentA ALIGNMENTA
                        Path to the 1st alignment file
  -b ALIGNMENTB, --alignmentB ALIGNMENTB
                        Path to the 2nd alignment file
  -at {mhg,mauve,cactus,sibeliaz}, --AType {mhg,mauve,cactus,sibeliaz}
                        File format of the 1st alignment file; 'txt' for mhg/mhg-evo; 'xmfa' for mauve; 'maf' for
                        cactus/sibeliaz
  -bt {mhg,mauve,cactus,sibeliaz}, --BType {mhg,mauve,cactus,sibeliaz}
                        File format of the 2nd alignment file; 'txt' for mhg/mhg-evo; 'xmfa' for mauve; 'maf' for
                        cactus/sibeliaz
  -gb GENOME_BED_OUTPUT, --genome_bed_output GENOME_BED_OUTPUT
                        Output path for genome bed file recording all contig length, default to 'genome.bed'
  -pa PREFIXA, --prefixA PREFIXA
                        Prefix for the 1st alignment file, default is 'A'
  -pb PREFIXB, --prefixB PREFIXB
                        Prefix for the 2nd alignment file, default is 'B'
  -t THRESHOLD, --threshold THRESHOLD
                        If an alignment is shorter than the threshold, its fractionalization value will not be
                        reported to avoid an over-representation for the total fractionalization values being too high
                        or too low. If it is needed to consider all alignments no matter for those being short, set
                        this value to 0

About

Multiple Sequence Alignment Comparison Statistic - Fractionalization

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages