smk_16S_ampseq - A Snakemake-based workflow for processing 16S rRNA gene amplicon sequencing data using DADA2

📌 Acknowledgement/Disclaimer

This workflow is based on the DADA2 workflow for big data, and heavily uses DADA2 Snakemake wrappers. I debugged the workflow in a HPC environment using SLURM for job submission, and noticed problems with accessing the respective wrappers while the workflow is running. This might be related to an issue posted on stackoverflow. As a consequence, I implemented the Snakemake wrappers as scripts into the workflow.

While writing this workflow, I found this related one written by SilasK very useful - check it out and give credit where credit is due.

❗ Needed/used software

The workflow is based on the following tools:

fastQC
bbduk part of the BBtools suite
DADA2

Please cite the respective papers/sources:

Andrews S. 2010. FastQC: a quality control tool for high throughput sequence data. http:https://www.bioinformatics.babraham.ac.uk/projects/fastqc.

Bushnell B. 2016. BBMap short read aligner. https://www.sourceforge.net/projects/bbmap/.

Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. 2016. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 13:581–583.

The separate installation of the tools is not necessary, they are installed 'on the fly' (see Usage below).

Snakemake should be installed as outlined in its documentation for instance using conda/mamba. It is recommended to create a dedicated conda environment for Snakemake.

📘 Description of the workflow

Paired-end sequencing data is first subjected to primer trimming and adapter removal using bbduk. Quality reports are written using fastQCbefore and after preprocessing.

In the following, data sets are treated according to the big data DADA2 workflow using mostly DADA2 Snakemake wrappers:

Determine quality profiles
Filter and trim
Learn errors
Dereplicate, denoise, make sequence/count table
Chimera removal
Taxonomic assignment

The final sequence/count table (after chimera removal) (results/06_DADA2_CHIMERACHECK/ seqTab.nochim.RDS) and the taxonomic assignments (ls 07_DADA2_TAX_ASSIGN/ taxa.RDS) can be used for downstream processing/analysis, for instance using phyloseq

The below DAG graph outlines the different processes of the workflow.

🔨 Usage

Start by cloning the repository and move into respective directory.

git clone https://github.com/wegnerce/smk_16S_ampseq.git
cd smk_16S_ampseq

Place paired sequence data in data/. The workflow expects the following sample nomenclature: »» NameOfSample_R{1,2}.fastq.gz

The repository contains one exemplary pair of files (K6_rep1_R1.fastq.gz + K6_rep1_R2.fastq.gz).

config/ contains, besides from the configuration of the workflow (config/config.yaml), a tab-separated table samples.tsv, which contains a list of all datasets, one per line (and potential metadata, for the workflow only sample names are needed right now).

config/config.yaml should be modified dependent on the used primer set for amplicon sequencing. By default, the widely used primer set 341F/785R (Klindworth et al., 2013) is pre-defined in the bbduksettings for primer removal.

The DADA2 wrappers/scripts can be modified as needed, based on information in the big data DADA2 workflow and the documentation of the DADA2 Snakemake wrappers.

Taxonomic assignments are done using SILVA reference databases, maintained by the DADA2developers.

# move into the resources directory
cd resources
# download a pre-formatted SILVA reference DB
wget https://zenodo.org/records/4587955/files/silva_nr99_v138.1_train_set.fa.gz
cd ..

From the root directory of the workflow, processing the data can then be started.

# --use-conda makes sure that needed tools are installed based
# on the requirements specified in the respective *.yaml in /envs
snakemake  --use-conda

In HPC environents using SLURM for job submission, the workflow can be run after setting up a Snakemake SLURM profile, check out this repository if you are interested.

The directory structure of the workflow is shown below:

├── LICENSE
├── config
│   ├── config.yaml
│   └── samples.tsv
├── data
│   ├── K6_rep1_R1.fastq.gz
│   └── K6_rep1_R2.fastq.gz
├── logs
├── resources
│   ├── adapters.fa
│   └── silva_nr99_v138.1_train_set.fa.gz
├── results
│   ├── 01_TRIMMED
│   ├── 02_DADA2_QUAL_PROFILES
│   ├── 03_DADA2_TRIMMED
│   ├── 04_DADA2_ERROR_MODELS
│   ├── 04_DADA2_UNIQ
│   ├── 05_DADA2_SEQTAB
│   ├── 06_DADA2_CHIMERACHECK
│   └── 07_DADA2_TAX_ASSIGN
└── workflow
    ├── Snakefile
    ├── envs
    ├── rules
    └── scripts

Output from the different steps of the workflow are stored in /results and /logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

smk_16S_ampseq - A Snakemake-based workflow for processing 16S rRNA gene amplicon sequencing data using DADA2

📌 Acknowledgement/Disclaimer

❗ Needed/used software

📘 Description of the workflow

🔨 Usage

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
data		data
resources		resources
workflow		workflow
LICENSE		LICENSE
README.md		README.md
dag.svg		dag.svg

License

wegnerce/smk_16S_ampseq

Folders and files

Latest commit

History

Repository files navigation

smk_16S_ampseq - A Snakemake-based workflow for processing 16S rRNA gene amplicon sequencing data using DADA2

📌 Acknowledgement/Disclaimer

❗ Needed/used software

📘 Description of the workflow

🔨 Usage

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages