This is a modified version of PanGenie that supports spaced seeds.
Spaced seeds (patterns of care and don't care positions) are more resistent to sequencing errors than conventional contiguous k-mers and may be able to increase sensitivity of Bioinformatics applications. MaskedPanGenie implements this by allowing to add a spaced seed as optional parameter (-m), which will be used for the detection of unique "spaced" kmers in the provided pangenome graph.
These spaced k-mers will then be compared to spaced k-mers of the reads. PanGenie uses the k-mer counter Jellyfish for this, but unfortunately Jellyfish does not support spaced seeds. Hence, I developed a lightweight tool "MaskJelly" which can be used in combination with Jellyfish to create dictionaries of spaced k-mers before MaskedPanGenie gets executed.
Please note that this additional preprocessing step requires additional time and resources, which slows the genotyping pipeline down. This problem could be resolved by implementing a counting tool specialised on spaced k-mers, but this is outside of my technical expertise. MaskedPanGenie is intended as a prototype to analyse the effects of spaced seeds and does not aim to be production-level software.
- MaskJelly https://github.com/hhaentze/MaskJelly
- Jellyfish https://github.com/gmarcais/Jellyfish
See installation instructions below, but replace repository with:
https://github.com/hhaentze/MaskedPangenie
See example
A short-read genotyper for various types of genetic variants (such as SNPs, indels and structural variants) represented in a pangenome graph. Genotypes are computed based on read k-mer counts and a panel of known haplotypes. A description of the method can be found here: https://doi.org/10.1038/s41588-022-01043-w
- conda or Singularity
git clone https://[email protected]/jana_ebler/pangenie.git
cd pangenie
conda env create -f environment.yml
conda activate pangenie
mkdir build; cd build; cmake .. ; make
Use the Singularity definition file located in container/
to build an (Ubuntu-based) container as follows (requires root privileges):
[sudo] singularity build pangenie.sif pangenie.def
In all usage examples below, call the PanGenie
executable as follows:
singularity exec pangenie.sif PanGenie <PARAMETERS>
For example, to show PanGenie
's command line help, use the following command:
singularity exec pangenie.sif PanGenie --help
You can check which versions of PanGenie
(git hash) and of the jellyfish
library have been installed in the container by running the following commands:
singularity exec pangenie.sif cat /metadata/jellyfish.lib.version
should produce a line like this (so, here, v2.3.0):
$ libjellyfish-2.0-2:amd64 2.3.0-4build1 libjellyfish-2.0-dev:amd64 2.3.0-4build1
singularity exec pangenie.sif cat /metadata/pangenie.git.version
should produce a line like this:
$ 5a1f9c5
PanGenie is a pangenome-based genotyper using short-read data. It computes genotypes for variants represented as bubbles in a pangenome graph by taking information of already known haplotypes (represented as paths through the graph) into account. The required input files are described in detail below.
PanGenie expects a directed and acyclic pangenome graph as input (-v
option).
This graph is represented in terms of a VCF file that needs to have certain properties:
- multi-sample - it needs to contain haplotype information of at least one known sample
- fully-phased - haplotype information of the known panel samples are represented by phased genotypes and each sample must be phased in a single block (i.e. from start to end).
- non-overlapping variants - the VCF represents a pangenome graph. Therefore, overlapping variation must be represented in a single, multi-allelic variant record.
Note especially the third property listed above. See the figure below for an illustration of how overlapping variant alleles need to be represented in the input VCF provided to PanGenie.
We typically generate such VCFs from haplotype-resolved assemblies using this pipeline: https://bitbucket.org/jana_ebler/vcf-merging . However, any VCF with the properties listed above can be used as input.
In this case you can run PanGenie using the Snakemake pipeline provided in pipelines/run-from-callset/
. This automatically merges overlapping alleles into mult-allelic VCF, runs PanGenie and later converts the output VCF back to the original representation.
PanGenie is k-mer based and thus expects short reads as input. Reads must be provided in a single FASTA or FASTQ file using the -i
option.
PanGenie also needs a reference genome in FASTA format which can be provided using option -r
.
PanGenie can be run using the command shown below:
./build/src/PanGenie -i <reads.fa/fq> -r <reference.fa> -v <variants.vcf> -t <nr threads for genotyping> -j <nr threads for k-mer counting>
The result will be a VCF file containing genotypes for the variants provided in the input VCF. Per default, the name of the output VCF is result_genotyping.vcf
. You can specify the prefix of the output file using option -o <prefix>
, i.e. the output file will be named as <prefix>_genotyping.vcf
.
The full list of options is provided below.
program: PanGenie - genotyping and phasing based on kmer-counting and known haplotype sequences.
author: Jana Ebler
usage: PanGenie [options] -i <reads.fa/fq> -r <reference.fa> -v <variants.vcf>
options:
-c count all read kmers instead of only those located in graph.
-d do not add reference as additional path.
-e VAL size of hash used by jellyfish. (default: 3000000000).
-g run genotyping (Forward backward algorithm, default behaviour).
-i VAL sequencing reads in FASTA/FASTQ format or Jellyfish databa