Skip to content

TF-Chan-Lab/miRDeep2_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

miRDeep2_pipeline

This is a general pipeline for performing miRNA prediction and differential expression analysis using small RNA sequencing (sRNA-seq) data of plant samples. In this pipeline, the tool for miRNA prediction from sRNA-seq data is miRDeep2 (https://github.com/rajewsky-lab/mirdeep2). The output of miRDeep2 is parsed by parse_miRDeep2_prediction.pl script in this pipeline. The parsed predicted miRNA sequences are searched against known miRNA sequences from the public domain for annotation purpose. In differential expression analysis, the sRNA-seq read data are mapped to the predicted miRNA sequences using Bowtie (https://bowtie-bio.sourceforge.net/index.shtml) and the read count for each miRNA sequence is generated. Finally, with the miRNA count data, DESeq2 (https://bioconductor.org/packages/release/bioc/html/DESeq2.html) is used to perform data filtering, normalization and differential expression analysis. In addition, it is recommended to merge the predicted miRNA sequences with those present in the public domain, such as miRBase, for differential expression analysis, in order to increase the sensitivity.

The details of using this pipeline are described as following. And the pipeline assumes that all files and the directory of this pipeline are in the same directory. It is also assumed that all the tools can readily be called in command line.

Computational environment

Installation of the pipeline

git clone https://github.com/alanlamsiu/miRDeep2_pipeline.git

Files

  • Clean sRNA-seq data in .fq format for different samples, named sample1.fq, sample2.fq and etc.

    • The clean sRNA-seq data are usually generated from the raw data after adapter trimming using a proper tool, e.g. Trimmomatic (https://www.usadellab.org/cms/?page=trimmomatic). This pipeline assumes that the input sRNA-seq data are already trimmed for adpaters and clean.
  • Reference genome sequence in .fa format, named ref.fa

    • Index file of the reference genome seqeunce generated by Bowtie using the following command

    bowtie-build ref.fa ref.fa

  • Known miRNA sequences from the public domian in .fa format, e.g mature.fa from miRBase (https://www.mirbase.org/ftp.shtml)

    • If sequences in miRBase are used, the mature.fa file needs to be properly preprocessed as following.

      • To convert all "U" nucleotide to "T"

      perl ./miRDeep2_pipeline/script/fasta_U2T.pl mature.fa mature_wo_U.fa

      • To collapse unique mature sequences in mature_wo_U.fa

      perl ./miRDeep2_pipeline/script/unique_fasta_v1.2.pl mature_wo_U.fa mature_wo_U_uniq.fa mature_uniq

    • Index file of the known miRNA sequences, e.g. the miRBase sequences, generated by blast using the following command

    makeblastdb -in mature_wo_U_uniq.fa -dbtype 'nucl'

1. miRNA prediction using miRDeep2

Please follow at https://github.com/rajewsky-lab/mirdeep2 to run miRDeep2. It is not recommended to supply fasta files of known miRNAs when using miRDeep2 as the results from miRDeep2 will be compared with kown miRNAs for annotation. The results from miRDeep2 can be parsed as follows.

perl ./miRDeep2_pipeline/script/parse_miRDeep2_prediction.pl miRDeep2_predictions_list.txt miRDeep2

The file miRDeep2_predictions_list.txt contains file names of the _predictions files, e.g. sample1_predictions, to be analyzed together. Each line contains one file name. There are three resulted files, named miRDeep2_mature.fa, miRDeep2_pre.fa and miRDeep2_pre.gff, respectively, with miRDeep2 as the prefix. The sequence IDs in miRDeep2_mature.fa and miRDeep2_pre.fa always begin with the prefix miRDeep2_mature_ and miRDeep2_precursor_, respectively, followed by order numbers of the unique sequences. The genomic orgin, precursor and primary miRNA sequence inforamtion can be found in the description field of each sequence entry in miRDeep2_mature.fa. The positional information of the mature and star sequences can be found in the description field of each sequence entry in miRDeep2_pre.fa. The miRDeep2_pre.gff file is a GFF representation of the description filed in miRDeep2_pre.fa.

2. Annotation of miRNA

After getting the predicted miRNA sequences, it is informative to annotated the sequences with those already deposited in the public domain, e.g. miRBase. In this pipeline, the predicted miRNA sequences are first searched against the database of interest using blast using the following commands.

blastn -db mature_wo_U_uniq.fa -query miRDeep2_mature.fa -out miRDeep2_mature_blastn.txt -word_size 4 -num_alignment 1

perl ./miRDeep2_pipeline/script/general_blast_parser.pl miRDeep2_mature_blastn.txt miRDeep2_mature_blastn_parsed.txt

perl ./miRDeep2_pipeline/script/parse_parsed_blast_known.pl miRDeep2_mature_blastn_parsed.txt miRDeep2_mature

The search results are then parsed to identify known miRNAs. Known miRNA that is already in the database of interest with hit coverage >= 0.9, absence of indel, identifcal seed region (first 2 - 7 nucleotides) and less or equal to two mismatches.

The output file will contain all entries of known miRNAs.

3. Differnetial expression anlaysis

To perform differential expression analysis, the clean read is first mapped to the reference miRNA sequences. In step 1, a miRNA sequences file, miRDeep2_mature.fa, is generated. This file can be used as the reference for mapping. Alternatively, a combination of sequences in miRDeep2_mature.fa and those present in the pubic domain, e.g. miRBase, but missed by miRDeep2 can be also served as the reference. In the following analysis, the file of reference miRNA sequences is named miRNA_ref.fa and indexed using the following command.

bowtie-build miRNA_ref.fa miRNA_ref.fa

Then each sample, e.g. sample1.fq, is mapped to the reference miRNA sequences using the following command.

bowtie -v 0 --norc -S miRNA_ref.fa sample1.fq | samtools view -Sb - > sample1.bam

With the alignment .bam file, the number of reads perfectly mapped to each reference miRNA sequence is generated using the following command.

perl ./miRDeep2_pipeline/script/bam2ref_counts.pl -bam sample1.bam -f miRNA_ref.fa > sample1_count.txt

To combine the read counts data for each sample into a table, the following command is used.

perl ./miRDeep2_pipeline/script/combine_htseq_counts.pl count_list.txt count_table.txt

The file, count_list.txt, contains files names of the count files, e.g. sample1_count.txt, to be combined and analyzed together. Each line contains one file name. In the output file, count_table.txt, the last four columns represent the mean, median, variance and coefficient of variation. Prior to differential expression analysis, the last four columns in count_table.txt should be removed. The count table is now ready for normalization and differential expression analysis using DESeq2. It is recommended to follow the Beginner's guide to using the DESeq2 package (https://bioc.ism.ac.jp/packages/2.14/bioc/vignettes/DESeq2/inst/doc/beginner.pdf).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages