Inspired by SciLifeLab/NGI-smRNAseq.
Here we take a simple approach to map smallRNA-sequencing reads and map to miRNA hairpins downloaded form miRBase.
The following command line parameters can be used to customise running of the pipeline:
--reads
... provide the read files to use as input. Wildcard possible, but neads to be provided in single quotation marks[Required]
--genome
... specify the genome (or version) to use. Currently only compatible withdm6
[Required]
--mismatches
... define the number of mismatches to use (Possible values: 0-3, Default: 3)--adapter
... Adapter sequence to use for clipping (Default: Illumina TrueSeq 3' adapter + index primer:AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
)--norand
... Use this to prevent trimming ofNNNN
from both 5' and 3' end of reads--annoreads
... Sample file, providing columnsidx
,sRNAreads
,miRNAreads
,time
for each sample[Required]
The pipeline assumes a certain file format: all input files are labelled IDX_filename.fq
(.gz
files are handelled by the pipeline transparently). This same IDX needs to separated from the remaining file name by an _
(underscore) and is expected in the idx
column of the sample file passed through --annoreads
.
The sample file requires the following columns (in no specific order): idx
, sRNAreads
, miRNAreads
, time
. To identify each sample uniquely, idx
needs to have a unique value, and be matched with the IDX
section in the input file names. The values for sRNAreads
and miRNAreads
can be pre-calculated by external tools. The time
column is helpful for downstream analysis, if analysing time course experiments, but currently not evaluated in the scope of this pipeline.
Currently for fly data, we map our smallRNA reads to the genome, perform hirarchical assignment of their annotation class (e.g. rRNA, tRNA, mitochondiral, snoRNA, snRNA, miRNA, piRNA, exonic, intronic, TEs) and then use the sum of all smallRNAs mapping reads as normalisation factor.
Currently not used, but provided for optional normalisation in downstream exploratory analyses.
## Example file formatting
Given three samples, that were in a time course and taken after 0
, 1
and 2
hours of treatment, the sequencing data could be labelled as follows:
Input files: 1234_sample1_0h.fq
, 1235_sample2_1h.fq
, 1236_sample3_2h.fq
After mapping and categorising all small RNA and miRNA mapping reads, we could have the following summary table:
filename | sRNAreads | miRNAreads |
---|---|---|
1234_sample1_0h.fq | 10239489 | 8172363 |
1235_sample2_1h.fq | 14864299 | 10236982 |
1236_sample3_2h.fq | 11928376 | 9283726 |
-------------------- | ----------- | ------------ |
The manually created sample file, passed after the --annoreads
parameter should therefore be formatted as follows:
idx | sRNAreads | miRNAreads | time |
---|---|---|---|
1234 | 10239489 | 8172363 | 0 |
1235 | 14864299 | 10236982 | 1 |
1236 | 11928376 | 9283726 | 2 |
- Compatibility
- Add other genomes (specifically: mouse & human)
- Preprocessing
- Nextflow sequential annotation pipeline
- ribosomal DNA
- Mitochondrial genome
- genome unique, then multimappers
- pre-miRs unique, multimappers
- Quantify from above -- extract smallRNA reads for normalisation
- Extract smallRNA mappers from custom bash-AnnotationPipeline?
- Nextflow sequential annotation pipeline
- Analysis/Reporting
- Tally all mutations in boxplot
- Combining R analysis and nextflow
- Documentation
- Installation of the pipeline
- Adding custom genomes
- Usability
- Unit tests?