This is an implementation of prefix-free parsing that was originally proposed by:
@article{boucher2019prefix,
title={Prefix-free parsing for building big BWTs},
author={Boucher, Christina and Gagie, Travis and Kuhnle, Alan and Langmead, Ben and Manzini, Giovanni and Mun, Taher},
journal={Algorithms for Molecular Biology},
volume={14},
number={1},
pages={1--15},
year={2019},
publisher={BioMed Central}
}
This has been modified to allow for optimization and allows for Variant Call Format (VCF) files to be used in replace of fasta files. The publication corresponding to this work is as follows:
@inproceedings{oliva2022csts,
title={CSTs for Terabyte-Sized Data},
author={Oliva, Marco and Cenzato, Davide and Rossi, Massimiliano and Lipt{\'a}k, Zsuzsanna and Gagie, Travis and Boucher, Christina},
booktitle={2022 Data Compression Conference (DCC)},
pages={93--102},
year={2022},
organization={IEEE}
}
This work was supported by NIH R01AI141810 and made publicly available under GNU license. If you use any parts of the repository, please acknowledge via citation of the above publications and this repository.
This tool produces the same result as running bigbwt
on the fasta file generated as follow:
cat reference.fa | bcftools consensus calls.vcf.gz -H 1 > consensus.fa
Symbolic alleles are currently not supported, e.g. <CN1>
.
PFP is available on bioconda
:
conda install -c bioconda -c conda-forge pfp
pfp++ --help
PFP is available on docker:
docker pull moliva3/pfp:latest
docker run moliva3/pfp:latest pfp++ --help
If using singularity:
singularity pull pfp_sif docker:https://moliva3/pfp:latest
./pfp_sif pfp++ --help
- Htslib
- OpenMP
git clone https://github.com/marco-oliva/pfp.git
cd pfp
mkdir build && cd build
cmake ..
make
PFP++
Usage: pfp++ [OPTIONS]
Options:
-h,--help Print this help message and exit.
-v,--vcf TEXT ... List of comma ',' separated vcf files. Assuming in genome order!
-r,--ref TEXT ... List of comma ',' separated reference files. Assuming in genome order!
-f,--fasta TEXT:FILE Fasta file to parse.
-i,--int32t TEXT:FILE Integers file to parse.
--int-shift UINT:INT in [0 - 200]
Each integer i in int32t input is interpreted as (i + int-shift).
-H,--haplotype TEXT Haplotype: [1,2,12].
-t,--text TEXT:FILE Text file to parse.
-o,--out-prefix TEXT Output prefix.
-m,--max UINT Max number of samples to analyze.
-S,--samples TEXT File containing the list of samples to parse.
-w,--window-size UINT:INT in [3 - 200]
Sliding window size.
-p,--modulo UINT:INT in [5 - 20000]
Modulo used during parsing.
-j,--threads UINT Number of threads.
--tmp-dir TEXT:DIR Temporary files directory.
-c,--compress-dictionary Also output compressed the dictionary.
--use-vcf-acceleration Use reference parse to avoid re-parsing.
--print-statistics Print out csv containing stats.
--output-occurrences Output count for each dictionary phrase.
--output-sai Output sai array.
--output-last Output last array.
--acgt-only Convert all non ACGT characters from a VCF or FASTA file to N.
--verbose Verbose output.
--version Version number.
--configure Read an ini file.