This repository is dedicated to the experiment section of the paper: Parmigiani, L., Wittler, R., Stoye, J.: Revisiting pangenome openness with k-mers. PCJ Math & Comp Biol. (2024).
- The analysis was performed on twelve bacterial pangenomes
- Tools:
- Datasets: 12 bacterial species, from NCBI RefSeq, with the filter “Assembly level: Complete genome”.
- Annotations: Prokka(1.14.6, standard parameters).
Note: some of these steps can take several hours (e.g., gene homology
clustering, annotation), therefore we have uploaded the final results of all the steps
with the scripts to work with them in the folder analysis
.
You can use directly that with the R scripts analysis/reading.R
and
analysis/fitting.R
to read and fit the data.
To generate the results from each tool we provide scripts and a snakemake workflow. The pipeline is divided into steps, so that it is not necessary to run all if you want to reproduce just some parts of the results.
Each tool can be run using snakemake, except for BPGA for which a script is provided.
The folder config
contains snakemake config files that specify on which
bacteria the pipeline should be run and for which value of k. As before, three
options are given (config_all
, config_openclosed
, config_quicktest
) but
any combination of species can be created, it is enough to provide the kingdom
(in this case Bacteria
for all), genus
(e.g., Francisella
) and species
(e.g., tularensis
).
Genomes were downloaded from NCBI using ncbi-genome-download.
pip install ncbi-genome-download
If the NCBI API changed or the tool is not working anymore, the folder
data/accession
provides the accession numbers for each genome used.
These are also used to download the exact dataset we used.
In the folder we provide a script (data/download.sh
) to download all the
genomes from NCBI. Note that some steps requires the genome to be uncompressed.
The script automatically uncompress the genomes and leaves them uncompressed for
the whole time. The whole uncompressed dataset is around 5GB.
Each species will be downloaded in the folder data
.
Usage:
./data/download.sh all
If you want to download the dataset used for the most of the images you can run:
./data/download.sh openclosed
which downloads only the genomes for Streptococcus pneumoniae and Yersinia pestis.
For a quicker test there is also:
./data/download.sh quicktest
Which downloads only 57 genomes of Francisella tularensis.
Install pangrowth
cd scripts
git clone https://gitlab.ub.uni-bielefeld.de/gi/pangrowth
cd pangrowth
make
cd ../..
This produce the histogram and the pangenome growth for k-mers for each species
in the config
file. The output can be find in results/data/kmer_k/species
(e.g., results/data/kmer_17/Francisella_tularensis
).
snakemake --cores 12 --latency-wait 60 kmer_run -p --verbose --rerun-incomplete --configfile config/config_quicktest.yaml
snakemake --cores 12 --latency-wait 60 kmer_run -p --verbose --rerun-incomplete --configfile config/config_openclosed.yaml
snakemake --cores 12 --latency-wait 60 kmer_run -p --verbose --rerun-incomplete --configfile config/config_all.yaml
The genes pipeline requires the annotation with Prokka. This step can be
run manually or it will be called automatically by running either roary_run
or pantools_run
. The Annotations are also required if you want to run BPGA.
Ubuntu
sudo apt-get install roary
This produce the histogram and the pangenome growth for genes for each species
in the config
file. The output can be find in results/data/gene_roary/species
(e.g., results/data/gene_roary/Francisella_tularensis
).
snakemake --cores 12 --latency-wait 60 roary_run -p --verbose --rerun-incomplete --configfile config/config_quicktest.yaml
snakemake --cores 12 --latency-wait 60 roary_run -p --verbose --rerun-incomplete --configfile config/config_openclosed.yaml
snakemake --cores 12 --latency-wait 60 roary_run -p --verbose --rerun-incomplete --configfile config/config_all.yaml
- Pantools
- Refer to the manual for their intallations.
- Pantools must be installed in the directory
scripts/pantools
Ubuntu
mkdir -p scripts/pantools
cd scripts/pantools
wget https://www.bioinformatics.nl/pangenomics/data/pantools-4.2.2.jar
This produce the histogram and the pangenome growth for genes for each species
in the config
file. The output can be find in results/data/gene_pantools/species
(e.g., results/data/gene_pantools/Francisella_tularensis
).
snakemake --cores 12 --latency-wait 60 pantools_run -p --verbose --rerun-incomplete --configfile config/config_quicktest.yaml
snakemake --cores 12 --latency-wait 60 pantools_run -p --verbose --rerun-incomplete --configfile config/config_openclosed.yaml
snakemake --cores 12 --latency-wait 60 pantools_run -p --verbose --rerun-incomplete --configfile config/config_all.yaml
BPGA can be installed from https://sourceforge.net/projects/bpgatool/files/.
You can find a script at scripts/bpga_automatic.sh
that accepts a species as
input and initiates a tmux session, automatically sending the necessary
keystrokes to run BPGA.
Please ensure the faa
variable within scripts/bpga_automatic.sh
is set to
the directory containing all the relevant protein files.
The script has to be copied into a directory containg BPGA as follows:
wget https://downloads.sourceforge.net/project/bpgatool/BPGA-1.3-linux-x86_64-0-0-0.tar.gz
tar -xf BPGA-1.3-linux-x86_64-0-0-0/
cp ./scripts/bpga_automatic.sh BPGA-1.3-linux-x86_64-0-0-0/BPGA-Version-1.3/bin
cd BPGA-1.3-linux-x86_64-0-0-0/BPGA-Version-1.3/bin
chmod +x BPGA-Version-1.3
If you encounter issues with the automated script, BPGA can also be executed manually. Simply pass in the directory containing the protein files.
The folder containing all the proteins can be obtained by running Prokka.
The folder analysis
contains all the scripts used to compare the results of
the three methods and the resulting data.
Since the computation of all the steps before can be onerous
we provided intermediate steps in the folder analysis/data
.
- The file
analysis/pankmer_reading.R
provides functions to read the results from Pangrowth, Roary, Pantools, BPGA and Prokka.