Skip to content

Snakemake workflow for the processing of EMseq data

License

Notifications You must be signed in to change notification settings

wegnerce/smk_emseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub tag License Made with Python DOI SnakeMake

smk_emseq - A Snakemake-based workflow for EMseq data processing

📌 Acknowledgement/Disclaimer

This workflow is heavily based on https://github.com/seb-mueller/snakemake-bisulfite by @seb-mueller. I basically just streamlined/tailored the workflow to my needs.

❗ Needed/used software

The workflow is based on the following tools:

The separate installation of the tools is not necessary, they are installed 'on the fly' (see Usage below).

Snakemake should be installed as outlined in its documentation for instance using conda/mamba. It is recommended to create a dedicated conda environment for Snakemake.

📘 Description of the workflow

The workflow can be equally used for the analysis of data derived from NEB's EMseq approach as well as data derived from WGBS (whole-genome bisulfite sequencing). Working with environmental microbes, we made the experience that the usage of commonly available bisulfite treatment kits lead to a strong loss of DNA. As a result, we ended up giving EMseq a try. For details about the methodology have a look at NEBs EMseq paper.

A reference genome stored in resources/ is bisulfite-treated in silico with bismark. Paired-end sequencing data (stored in data/) is subjected to quality-control and adapter-trimming using bbduk. Quality reports are written using fastQC before and after trimming.

Read pairs are subsequently mapped onto the bisulfite-treated genome, alignments with identical mapping positions are removed. Methylations are extracted for all three contexts (CpG, CHH, CHX) and used to generate a .bedGraph and coverage file. The latter can be used for downstream methylation analysis.

The below DAG graph outlines the different processes of the workflow.

DAG of smk_emseq.

🔨 Usage

Start by cloning the repository and move into respective directory.

git clone https://github.com/wegnerce/smk_emseq.git
cd smk_emseq

Place paired sequence data (R{1,2}.fastq.gz) in data/. The repository contains two pairs of exemplary files (La.1_R1.fastq.gz + La.1_R2.fastq.gz & Nd.1_R1.fastq.gz + Nd.1_R2.fastq.gz).

config/ contains, besides from the configuration of the workflow (config/config.yaml), a tab-separated table samples.tsv, which contains a list of all datasets, one per line. The workflow expects *.fastq.gzfiles and R{1,2} as prefixes for forward and reverse read files.

From the root directory of the workflow, processing the data can then be started.

# --use-conda makes sure that needed tools are installed based
# on the requirements specified in the respective *.yaml in /envs
snakemake  --use-conda

The directory structure of the workflow is shown below:

├── config
│   ├── config.yaml
│   └── samples.tsv
├── dag.svg
├── data
│   ├── La.1_R1.fastq.gz
│   ├── La.1_R2.fastq.gz
│   ├── Nd.1_R1.fastq.gz
│   └── Nd.1_R2.fastq.gz
├── logs
├── LICENSE
├── README.md
├── resources
│   ├── adapters.fa
│   └── RHAL1_chromosome_plasmid.fa
├── results
└── workflow
    ├── envs
    │   ├── bbmap.yaml
    │   ├── bismark.yaml
    │   ├── fastqc.yaml
    │   └── samtools.yaml
    ├── rules
    │   ├── bismark.smk
    │   ├── qc.smk
    │   └── sort.smk
    └── Snakefile

Output from the different steps of the workflow are stored in /results and /logs.

The resulting *.cov.gz files (results/04_coverage/*.cov.gz) can be used for downstream methylation analysis.

©️ Carl-Eric Wegner, 2023