Skip to content

This programme takes advantage of Stacks to analyze NGS data. We have successfully processed NGS data for Brassicacae plants and Triticum aestivum L.

Notifications You must be signed in to change notification settings

ShuXingYu94/NGS_Process_Sample

Repository files navigation

Sample code for NGS processing

This programme takes advantage of Stacks to analyze NGS data. We have successfully processed NGS data for Brassicacae plants and Triticum aestivum L.

Required environment

Shell

  • stacks
  • vcftools
  • samtools
  • trimmomatic
  • java
  • bwa

R

  • CMplot

Python

  • matplotlib
  • pandas
  • scipy
  • numpy
  • etc.

→ Please refer to requirements.txt for more information.

Installing Prerequisites

For shell package(s)

Please refer to the homepage of each package for more information.

For R package(s)

Run install.packages("package_needed") in R.

For Python package(s)

Run the following script in shell. This script will download requirements.txt and automatically install.

if command -v wget >/dev/null 2>&1; then
  wget -O ./requirements.txt https://raw.githubusercontent.com/ShuXingYu94/NGS_Process_Sample/master/requirements.txt
else
  if command -v curl >/dev/null 2>&1; then
    curl -o ./requirements.txt https://raw.githubusercontent.com/ShuXingYu94/NGS_Process_Sample/master/requirements.txt
  else
    echo "Wget and Curl are not installed. Please install."
    exit 0
  fi
fi

if command -v pip3 >/dev/null 2>&1; then
  pip3 install -r requirements.txt
else
  if command -v pip >/dev/null 2>&1; then
    pip install -r requirements.txt
  else
    echo "Something wrong with pip/pip3. Please install python packages with other software like conda."
    exit 0
  fi
fi

rm ./requirements.txt

Quick Guide

1. Working directory

Please start with the following folder structure.

work_dir
├── BaseCall
│   ├── sample1_R1_001.fastq.gz
│   ├── sample1_R2_001.fastq.gz
│   ├── sample2_R1_001.fastq.gz
│   └── sample2_R2_001.fastq.gz
├── MIGadapter.fasta → (Optional)
└── popmap.txt → (Optional)

Normally, putting NGS data files in ./work_dir/BaseCall is all you need to do.

  • For trimming of the original data files, an adapter.fasta file is needed. By default, a MIGadapter.fasta file will be downloaded.

  • In case of multiple population analysis, you can put a popmap.txt file in ./work_dir/. By default, a popmap.txt file will be automatically generated with all the samples recognized in the same population.

2. Download shell scripts and configuration

Download scripts

Download main.sh and statistics.sh files from the Master branch to your working directory.

You can find the recommended version in the latest release.

Required configurations

Input the required configurations in main.sh as follows.

# Required Settings

workdir=/Users/.../work_dir
mapping_db=/Users/.../reference_genome.fa
trimmomatic_dir=/Users/.../Trimmomatic-0.39/trimmomatic-0.39.jar
adapter=${workdir}/MIGadapter.fasta
popmap_dir=${workdir}/popmap.txt

# Read file names
files="sample1 sample2"

Optional configurations

Basically, there's no need to change the optional configurations.

However, in case of analyzing more than 2 samples, it is recommended to set stacks_r = 0.8 rather than stacks_r = 1.0 by default.

→ For more information, please refer to the documentation of Trimmomatic and Stacks.

3. Executing script

Note

Currently, copy all the commands and execute is recommended.

Please execute the following commands in the same directory as main.sh and statistics.sh,

bash main.sh
bash statistics.sh

or simply copy all the commands and execute them.

Ex. To rerun the programme

Warning

This is not a part of the analysis !

Check the directory and files you are deleting !

  1. To start from the beginning, delete all the generated files, please execute the following command:
rm -rf ${workdir}/aligned
rm -rf ${workdir}/log
rm -rf ${workdir}/stacks
rm -rf ${workdir}/trimmed
rm -rf ${workdir}/statistics
rm ${workdir}/*.py
rm ${workdir}/*.r

# Optional: rm ${workdir}/popmap.txt

This will remove all the generated and downloaded files and return to the original folder structure.

  1. To rerun the statistical analysis, please execute the following command:
rm -rf ${workdir}/statistics

Multithread Run

From the release later than v1.0.0, you can execute the script multithreading.sh along to carry out the analysis more than 20 times faster than before (By the time of 2023/1/3).

It takes advantage of multicore functionality to accelerate the whole calculating process, which will require more memory than before.

Please note that it is not recommended to run this script before you have successfully executed all the commands stated in Quick Guide for at least once.

Documentation

Alt text         ...still in progress, please follow the steps in Quick Guide.

Interpreting Results

1. Introduction of the folders

After execution of main.sh and statistics.sh, the folder should look something like this:


work_dir
├── BaseCall
├── aligned
├── log
├── stacks
├── statistics
│   ├── Consensus
│   ├── Coverage
│   └── Read_depth
└── trimmed

# the files inside is not shown

Here, the original fastq.gz data files are in folder ./BaseCall. After trimming with given adapter.txt file, the reads are saved in folder ./trimmed. The trimmed files are subsequently mapped to reference genome and saved in .bam format in folder ./algined. Then, Stacks were used to analyze mapped reads, carry out SNPs calling and population analysis. These results are saved in ./stacks.

Eventually, the statistical analysis were carried out with statistics.sh using the above files. The generated result files are kept in folder ./statistics, which should look like the following.

statistics/
├── Consensus
│   ├── chr.txt
│   └── consensus.txt
├── Coverage
│   ├── sample1.txt
│   └── sample2.txt
├── Read_depth
│   ├── sample1.txt
│   └── sample2.txt
├── SNP_Depth_0.jpg
├── SNP_Depth_10.jpg
├── SNPs_Distribution_0.svg
├── SNPs_Distribution_10.svg
└── results.txt

Basically, all the statistical results and figures we need is in this folder.

2. Statistical results preview

Currently, the following results are available.

  • Data in table 1
  • Reads Count of Raw Data
  • Reads Count of Trimmed Data
  • Mapped Length of Trimmed Data
  • Mapped Coverage of Trimmed Data
  • Average Depth of Mapped Reads
  • Consensus Length
  • Consensus Coverage
  • Data in table 2
  • SNPs Count
  • Average SNP Depth
  • Mapped Length / SNPs
  • Consensus Length / SNPs
  • Figures
  • SNPs Distribution Based on Chromosomes
    • Python - SNP distribution
    • R - SNP depth and distribution

All the numerical data are saved in ./statistics/results.txt. It is recommended to open the file with Excel or any software that supports .tsv format.

3. About numerical data

File ./statistics/results.txt contains two independent tables.

  • Table 1 contains information from Reads Count of Raw Data to Consensus Coverage.
  • Table 2 contains information from SNPs Count to Consensus Length / SNPs

4. About figures

SNPs Distribution Figure

SNPs Distribution Figure shows the distribution of SNPs on the chromosomes using reference genome information.

  • File ./statistics/SNPs_Distribution_0.svg shows distribution of all the SNPs from ./stacks/populations.snps.vcf
  • File ./statistics/SNPs_Distribution_10.svg shows distribution of SNPs the depth of which is >= 10 on the first two samples.

SNP Depth Figure

SNP Depth Figure shows the depth of each SNPs on the chromosomes.

  • File ./statistics/SNP_Depth_0.jpg shows depth of all the SNPs from ./stacks/populations.snps.vcf
  • File ./statistics/SNP_Depth_10.jpg shows depth of SNPs the depth of which is >= 10 on the first two samples.

About

This programme takes advantage of Stacks to analyze NGS data. We have successfully processed NGS data for Brassicacae plants and Triticum aestivum L.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published