Snakemake workflow to analyze somatic copy number alterations from whole exome data using Sequenza and CNVkit
samples.csv
should be a CSV file with the following columns:
patient
- patient identifiersample_type
- eithernormal
ortumor
bam
- filepath for BAM for that sample
All patients are required to have both normal and tumor BAMs.
- Install Snakemake
- Install singularity or all dependencies (see Dependencies section)
- Clone repository
- Create
samples.csv
file with sample information (see Samples File Format section) - Modify
config.yaml
- If using a SLURM cluster for execution, edit
cluster.json
andrun_pipeline.sh
Singularity is recommended to ensure reproducibility and avoid having to manually manage installation of Sequenza, CNVkit, and R. If you would prefer to not use Singularity, make sure the following programs are installed and accessible from your command line:
sequenza-utils
command line tool- CNVkit
- R
The following R libraries are also needed:
sequenza
readr
NOTE: This workflow saves all intermediate and output files to the same
directory where Snakefile
is located. Ensure you have the appropriate storage
before running
If running locally
snakemake -j [# of cores] [--use-singularity]
If running on a SLURM cluster (after completing cluster.json
and run_pipeline.sh
# This needs to be run from the directory where Snakefile is located
sbatch run_pipeline.sh
results/sequenza_info.csv
- sequenza purity and ploidy estimates for each tumorcnvkit-results
- CNVKit segmentation files and call filessequenza/[patient] - results/
- sequenza segmentation files for that[patient]
access.5kb.hg38.bed
- accessible regions in the reference fasta at 5kb windows
sequenza_info.csv
and cnvkit-results/
will be the most interesting to users.
I primarily use Sequenza for purity/ploidy estimation and CNVKit for copy number segmentation.
See this paper for a discussion of copy number caller performance