Skip to content

Collection of rules performing QC and generating reports.

License

Notifications You must be signed in to change notification settings

hydra-genetics/qc

Repository files navigation

https://hydra-genetics-qc.readthedocs.io

hydra-genetics/qc

Snakemake module containing steps for quality control of sequencing data

compatibility integration test lint snakefmt snakemake dry run

License: GPL-3

💬 Introduction

The module contains rules to run different quality control tools. Most output can be compiled to one .html report using multiqc.

  • Fastqc is checking the raw .fastq files.
  • Mosdepth, samtools stats and picard's collect metrics generates reports on the resulting .bam files.
  • Peddy generates sex checks
  • Rseqc reports on different RNA-specific QC measures

❗ Dependencies

In order to use this module, the following dependencies are required:

hydra-genetics pandas python snakemake singularity drmaa tabulate

Note! Releases of qc <= v0.1.0 needs tabulate<0.9.0 added in requirements.txt

🎒 Preparations

Sample and unit data

Input data should be added to samples.tsv and units.tsv. The following information need to be added to these files:

Column Id Description
samples.tsv
sample unique sample/patient id, one per row
tumor_content ratio of tumor cells to total cells
units.tsv
sample same sample/patient id as in samples.tsv
type data type identifier (one letter), can be one of Tumor, Normal, RNA
platform type of sequencing platform, e.g. NovaSeq
machine specific machine id, e.g. NovaSeq instruments have @Axxxxx
flowcell identifer of flowcell used
lane flowcell lane number
barcode sequence library barcode/index, connect forward and reverse indices by +, e.g. ATGC+ATGC
fastq1/2 absolute path to forward and reverse reads
adapter adapter sequences to be trimmed, separated by comma

Reference data

A reference .fasta-file should be specified in config.yaml in the section reference and fasta. Picard requires also interval files for WGS, wgs_intervals, and for HS metrics, design_intervals. Mosdepth, samtools, and Rseqc require a .bed file with specified regions to be analysed.

MultiQC

Files to be included in the MultiQC report is specified in the config file, see example config. The report can be split into several reports based on for example sample type or rna/dna. A report configuration file for each MultiQC report can also be specified by the config tag in the config file which is forwarded to MultiQC.

✅ Testing

The workflow repository contains a small test dataset .tests/integration which can be run like so:

$ cd .tests/integration
$ snakemake -s ../../Snakefile -j1 --configfile config.yaml --use-singularity

🚀 Usage

To use this module in your workflow, follow the description in the snakemake docs. Add the module to your Snakefile like so:

module qc:
    snakefile:
        github(
            "hydra-genetics/qc",
            path="workflow/Snakefile",
            tag="v0.1.0",
        )
    config:
        config


use rule * from qc as qc_*

Compatibility

Latest:

  • alignment:v0.4.0
  • prealignment:v1.1.0

See COMPATIBLITY.md file for a complete list of module compatibility.

Output files

The following output files should be targeted via another rule:

File Description
qc/multiqc/multiqc.html .html report from multiqc
qc/multiqc/multiqc_data directory holding multiqc data

🧑‍⚖️ Rule Graph

rule_graph