An automated bacterial whole genome assembly and typing pipeline which primarily uses Illumina paired-end whole genome sequencing (WGS) data. In addition, Jekesa performs extensive analyses for Escherichia coli, Salmonella, Streptococcus pneumoniae and Streptococcus pyogenes (Group A Streptococcus), including in-depth virulence predicitions for various other pathogens (refer to sections below). Furthermore, Jekesa, also performs whole-genome reference-free alignments, pairwise SNP-site analysis and clustering, and generates a neighbor-joining tree which can be easily visualized using e.g. Microreact.
Jekesa (Illuminate) currently runs on a server (single compute node). The pipeline is written in Bash, R, and Rmarkdown, and generates the results report in an excel worksheet (.xlsx format) and html format.
- QC and read filtering using FastQC and trim_galore.
- Species identification and closest reference detection using Bactinspector.
- Check for contamination using ConFindr, kraken2 and MiniKraken2_v2_8GB
- De novo assembly using either SKESA, SPAdes, MEGAHIT, or velvet as implemented in Shovill.
- Generation of assembly metrics using QUAST
- Multi-locus sequence typing based on assembled contigs using mlst and PubMLST database.
- Detection of acquired AMR genes and chromosomal mutations and their associated resistance phenotypes performed using resfinder, AMRFinderPlus and pointfinder.
- Optionally, known and novel variants in anti-microbial resistance genes, predicted from clean reads using ariba and either CARD (The Comprehensive Antibiotic Resistance Database) or resfinder database.
- Virulence genes detected using AMRFinderPlus.
- In addition, in-depth virulence gene detection for specific pathogens such as E. coli, E. faecalis, E. faecium, S aureus and L. monocytogenes is performed using VirulenceFinder.
- Optionally, detection of variants (known/novel) in virulence factor genes, from cleaned reads, using ariba and the VFDB. ARIBA can be activated by uncommenting the ARIBA specific scripts in the main JEKESA script.
- Coming soon
- Serotyping using SerotypeFinder.
- Serotyping using seroba
- Pili detection based on reference sequences used in Nakano et. al, 2018
- PBP gene typing and MIC profiling using CDC Streptococcus Lab SPN scripts and sequence databases.
- Calculate core and accessory distances and cluster genomes (assigning global pneumococcal sequence clusters; GPSCs) using PopPUNK, as well as assign new genomes to clusters.
- EMM typing and MIC profiling using CDC Stretococcus Lab GAS scripts and sequence databases.
- Calculate core and accessory distances and cluster/define genomes/strains using PopPUNK, as well as assign new genomes to clusters.
- Reference free alignments performed using SKA. In addition, SKA distance is used to calculate pairiwise SNP differences between samples and assign SNP-based clusters.
- The generated variant alignments are used to generate a neighbor-joining tree using rapidNJ with 1000 bootstrap replicates.
All results will be strored in Results-ProjectName
including:
- The final report named
ProjectName-WGS-typing-report.xlsx
- Results from each step of the analysis in .xlsx format
- Neighbor joining tree file (and associated files) generated using PopPUNK.
- Subfolders contatining:
- Detailed HTML report generated using
rmarkdown
usage: jekesa <options>
OPTIONS:
-p Path to output directory or project name
-a Select the assembler to use. Options available: 'spades', 'skesa', 'velvet', 'megahit'
-s Species scheme name to use for mlst typing.
Use: 'spneumoniae' or 'spyogenes' or 'senterica', for streptococcus pneumoniae or streptococcus pyogenes or salmonella
detailed analysis. Otherwise for any other schema use: 'other'. To check other available schema names use: mlst --longList.
-t Number of threads to use <integer>, (minimum value should be: 6)
-g Only perform de novo assembly
-c Path to assembled contigs to include in the typing analysis (only mlst and resistance profiling).
-h Show this help
-v Show version
cd jekesa
#This script will create analysis directory and soft link fastq files
bin/find-link-fastq.sh path/to/analysis/directory path/to/sampleID/list path/to/raw/fastqfiles
# Now run the jekesa pipeline
conda activate jekesa
jekesa -p path/to/analysis/directory -a skesa -s spyogenes -t 16 &
Clone the git repository:
git clone https://github.com/stanikae/jekesa.git
cd jekesa
After cloning the jekesa git repo, do the following to install the required dependencies and to setup the conda environment:
# JEKESA
conda env create -n jekesa --file ./lib/jekesa_v1.0.yml
conda env create -n r_env --file ./lib/jekesa-v1.0_r_env.yml
2. CGE tools
## ResFinder4
conda env create -n resfinder --file ./lib/jekesa-v1.0_cge.yml
## Other CGE tools
conda env create -n cge --file ./lib/jekesa-v1.0_resfinder4.yml
conda env create -n srst2 --file ./lib/jekesa-v1.0_srst2.yml
conda activate srst2
pip install spn_scripts/srst2_env/
conda deactivate
## Activate jekesa
conda activate jekesa
cd jekesa
git pull
conda env update -n jekesa --file ./lib/jekesa_v1.0.yml --prune
To download and set-up required databases, execute the 00.download_databases.sh
script
cd jekesa
conda activate jekesa
bash bin/00.download_databases.sh /path/to/installation/directory
To set up ConFindr databases kindly follow instructions here: https://olc-bioinformatics.github.io/ConFindr/install/
as this requires registration on PubMLST.
conda deactivate jekesa
Stanford Kwenda
Kwenda S., Allam M., Khumalo Z.T.H., Mtshali S., Mnyameni F., Ismail A. Jekesa: an automated easy-to-use pipeline for bacterial whole genome typing Github https://github.com/stanikae/jekesa