Skip to content

Snakemake workflow for the processing of 16S rRNA gene-derived amplicon sequencing data

License

Notifications You must be signed in to change notification settings

wegnerce/smk_16S_ampseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub tag License Made with Python DOI SnakeMake

smk_16S_ampseq - A Snakemake-based workflow for processing 16S rRNA gene amplicon sequencing data using DADA2

📌 Acknowledgement/Disclaimer

This workflow is based on the DADA2 workflow for big data, and heavily uses DADA2 Snakemake wrappers. I debugged the workflow in a HPC environment using SLURM for job submission, and noticed problems with accessing the respective wrappers while the workflow is running. This might be related to an issue posted on stackoverflow. As a consequence, I implemented the Snakemake wrappers as scripts into the workflow.

While writing this workflow, I found this related one written by SilasK very useful - check it out and give credit where credit is due.

❗ Needed/used software

The workflow is based on the following tools:

Please cite the respective papers/sources:

The separate installation of the tools is not necessary, they are installed 'on the fly' (see Usage below).

Snakemake should be installed as outlined in its documentation for instance using conda/mamba. It is recommended to create a dedicated conda environment for Snakemake.

📘 Description of the workflow

Paired-end sequencing data is first subjected to primer trimming and adapter removal using bbduk. Quality reports are written using fastQCbefore and after preprocessing.

In the following, data sets are treated according to the big data DADA2 workflow using mostly DADA2 Snakemake wrappers:

  1. Determine quality profiles
  2. Filter and trim
  3. Learn errors
  4. Dereplicate, denoise, make sequence/count table
  5. Chimera removal
  6. Taxonomic assignment

The final sequence/count table (after chimera removal) (results/06_DADA2_CHIMERACHECK/ seqTab.nochim.RDS) and the taxonomic assignments (ls 07_DADA2_TAX_ASSIGN/ taxa.RDS) can be used for downstream processing/analysis, for instance using phyloseq

The below DAG graph outlines the different processes of the workflow.

DAG of smk_16S_ampseq.

🔨 Usage

Start by cloning the repository and move into respective directory.

git clone https://github.com/wegnerce/smk_16S_ampseq.git
cd smk_16S_ampseq

Place paired sequence data in data/. The workflow expects the following sample nomenclature: »» NameOfSample_R{1,2}.fastq.gz

The repository contains one exemplary pair of files (K6_rep1_R1.fastq.gz + K6_rep1_R2.fastq.gz).

config/ contains, besides from the configuration of the workflow (config/config.yaml), a tab-separated table samples.tsv, which contains a list of all datasets, one per line (and potential metadata, for the workflow only sample names are needed right now).

config/config.yaml should be modified dependent on the used primer set for amplicon sequencing. By default, the widely used primer set 341F/785R (Klindworth et al., 2013) is pre-defined in the bbduksettings for primer removal.

The DADA2 wrappers/scripts can be modified as needed, based on information in the big data DADA2 workflow and the documentation of the DADA2 Snakemake wrappers.

Taxonomic assignments are done using SILVA reference databases, maintained by the DADA2developers.

# move into the resources directory
cd resources
# download a pre-formatted SILVA reference DB
wget https://zenodo.org/records/4587955/files/silva_nr99_v138.1_train_set.fa.gz
cd ..

From the root directory of the workflow, processing the data can then be started.

# --use-conda makes sure that needed tools are installed based
# on the requirements specified in the respective *.yaml in /envs
snakemake  --use-conda

In HPC environents using SLURM for job submission, the workflow can be run after setting up a Snakemake SLURM profile, check out this repository if you are interested.

The directory structure of the workflow is shown below:

├── LICENSE
├── config
│   ├── config.yaml
│   └── samples.tsv
├── data
│   ├── K6_rep1_R1.fastq.gz
│   └── K6_rep1_R2.fastq.gz
├── logs
├── resources
│   ├── adapters.fa
│   └── silva_nr99_v138.1_train_set.fa.gz
├── results
│   ├── 01_TRIMMED
│   ├── 02_DADA2_QUAL_PROFILES
│   ├── 03_DADA2_TRIMMED
│   ├── 04_DADA2_ERROR_MODELS
│   ├── 04_DADA2_UNIQ
│   ├── 05_DADA2_SEQTAB
│   ├── 06_DADA2_CHIMERACHECK
│   └── 07_DADA2_TAX_ASSIGN
└── workflow
    ├── Snakefile
    ├── envs
    ├── rules
    └── scripts

Output from the different steps of the workflow are stored in /results and /logs.

©️ Carl-Eric Wegner, 2023

About

Snakemake workflow for the processing of 16S rRNA gene-derived amplicon sequencing data

Resources

License

Stars

Watchers

Forks

Packages

No packages published