The MARIO (Measurement of Allelic Ratio Informatics Operator) pipeline was designed to identify Allele-Dependent Behavior (ADB) within a sequencing experiment at heterozygous positions identified through genotyping data.
The pipeline's flexible design allows for multiple uses, including downloading SRA files from NCBI, quality control on FASTQ files, aligning to a genome using three different aligners, etc. (detailed in diagram below).
The "Usage examples" section, below, demonstrates commands for various use cases.
For the associated publication, see Transcription factors operate across disease loci, with EBNA2 implicated in autoimmunity., Nature Genetics, 2018, by Harley JB, et al. Citation information for the MARIO pipeline software itself can be found below.
- fastqc
- fastq-dump (part of the NCBI SRA Toolkit)
- hisat2
- bowtie2
- STAR ~ 2.5.x
- samtools
- bedtools
- picard-tools MarkDuplicates (part of Picard)
- macs2 (also installable from PyPi)
The MARIO pipeline also depends on a locally-modified version of MOODS (using base-2 statistics), included in this repository.
MARIO is written in Perl and has a single third-party CPAN dependency, Parallel::ForkManager, which might already be installed by your sysadmin. If not, you could ask nicely for it to be installed, or refer to the "Third-party Perl modules" section below for instructions on how to do it yourself.
You may download the latest release as a compressed archive from GitHub, or clone the repository with Git:
# GitHub
git clone https://github.com/WeirauchLab/MARIO.git
# Weirauch Lab GitLab
git clone https://tfinternal.research.cchmc.org/gitlab/puj6ug/MARIO_pipeline.git
Then test your installation by run ./MARIO -h
from within the cloned repo or
expanded archive. If you receive an error about missing Perl modules, see the
next section.
The tool will automatically verify the presence of required external tools (listed above) based on the mode of operation.
The recommended way to install the required third-party CPAN module as a non-root user is to set up local::lib, then type
cpan Parallel::ForkManager
Newer versions of CPAN.pm support local::lib internally, and make the
necessary changes to your ~/.bashrc
during initial setup.
On systems with older (1.x) versions of CPAN, where local::lib is available (try
perl -Mlocal::lib
), perform these steps:
-
add this line to your shell's rcfile (e.g.,
~/.bashrc
)eval `perl -Mlocal::lib`
-
re-source your shell's rcfile (or quit and re-open your terminal session), then install the package with CPAN
source ~/.bashrc cpan Parallel::ForkManager
If your system does not have local::lib available, you could ask your sysadmin to install it globally, or else follow the bootstrapping instructions in the local::lib documentation.
- this step-by-step local::lib tutorial on the Mojolicous wiki
- if you get the error message
mkdir /root/.cpan - permission denied
when attempting to install the package with CPAN, refer to this post on perlmonks.org
Hint: running MARIO -h
produces a help screen, which you can then pipe through
less
.
MARIO -I SRR1608989 -C config_3.4.0.txt
MARIO -I SRR1608989.sra -C config_3.4.0.txt -uX path_to_BOWTIE2_aligner_index_files/hg19
MARIO -I SRR1608989.sra -C config_3.4.0.txt -sX path_to_STAR_aligner_index_files
MARIO -I SRR1608989.sra -C config_3.4.0.txt -tX path_to_HISAT2_aligner_index_files/hg19
MARIO -F SRR1_1.fq.gz:SRR1_2.fq.gz -C config_3.4.0.txt -sX \
path_to_STAR_aligner_index_files
MARIO -F SRR1_1.fq.gz:SRR1_2.fq.gz,SRR2.fq.gz -C config_3.4.0.txt -sX \
path_to_STAR_aligner_index_files
MARIO -cA SRR1.bam,SRR2.bam -C config_3.4.0.txt
MARIO -dA SRR1.bam,SRR2.bam -C config_3.4.0.txt -G path_to_genotyping_file/hetpos.txt
+-------------------------------------------+
| |
| +-----[I] +-----[B] +-----[G] |
| | SRAID | | BED | ----> | GEN | |
| +-------+ +-------+ (b) +-------+ |
| | ^ | |
| | | (c) | |
| v | v |
| +-----[F] +-----[A] +-----[D] | +=======+ +-----[C]
| | FASTQ | ----> | BAM | ----> | DAT | | ----> | ADB | <---- | ANNOT |
| +-------+ (a) +-------+ (d) +-------+ | +=======+ (n) +-------+
| (q) ^ | |
| | | |
| | | v
| +-----[X] | +=======+ +-----[M]
| | INDEX | | | HIT | <---- | MOTIF |
| +-------+ | +=======+ +-------+
| |
+-------------------------------------------+
BASIC FUNCTIONS ALLELE-DEPENDENT FUNCTIONS
-I SRA ID (i.e. SRR1608989 )
-F Fastq file (paired-end reads should be separated with \":\", like: FQ1:FQ2)
-A Alignment file (BAM format)
-D DAT file (first ouput of the MARIO pipeline containing raw allelic counts)
Priority of input files:
If multiple input files are privided (e.g.: SRA_ID, FASTQ and BAM files),
the pipeline starts with the file with the highest priority.
I<S<F<A<D (the DAT file has the highest priority)
-G Genotyping file with heterozygous positions
-X Index files for corresponding aligner (STAR, HISAT2 and BOWTIE2 supported)
For BOWTIE2, add to the end of the index path the base name common
to all the .bt2 files, like /path_to_index_files/hg19
For HISAT2, add to the end of the index path the base name common
to all the .ht2 files, like /path_to_index_files/hg19
For STAR, nothing need to be added to the index path
-C Configuration file (can be generated with the -y option)
-M (optional) File with a list of motifs (PWMs)
-a Align FASTQ reads to the genome (generates BAM file)
-d Find positions with ADBs (allele-dependent behavior)
-B (optional) Peaks file in BED format
-O Name of output folder (all files are saved here)
-c Call peaks
-b Do not require het-SNPs to fall within peaks
-i Integrate or concatenate multiple FASTQ files (input with -I of -F options)
Sometimes a single FASTQ file is split into multiple ones in GEO for a
single experiment
-n Annotate ADB results
It will use the GENANNO_FILE and/or DISANNO_FILE specified in config file
-p Number of threads (default: use all available threads)
-q Perform quality control on FASTQ files
-r Keep duplicate reads
-u Perform aligment with BOWTIE2 (suitable for ChIP-seq)
-s Perform alignment with STAR (suitable for RNA-seq)
-t Perform alignment with HISAT2 (suitable for RNA-seq)
(BED) If the -c option is given, MACS2 called peaks are produced as a BED file.
The BED file has 4 additional columns (6 through 9):
6. Number of reads under the peak
7. Peak width
8. RPKM, measured as the number of reads divided by the peak width,
multiplied by 1,000,000 divided by the total number of reads under
all peaks
9. TIER1 flag. If 1, the peak passed the minimum peak RPKM requirement
10. TIER2 flag. If 1, the peak passed the minimum peak width requirement
11. TIER3 flag. If 1, the peak passed the minimum peak reads requirement
(ADB) Allele-dependent behavior at each heterozygous positions, including
reproducibility score (ARS) and annotations.
(HIT) If motif files are given, the ADB file is further annotated with motif
hits on each heterozygous position.
- BugFix. Failed producing multiple BED files correctly
- Peak BED files now report entire MACS2 output (for compatibility with IDR calculations) plus RPKMs and number of reads under peaks
- Expanded trimming capabilities: Trims adapter sequences if QC on reads fails on "Per base sequence quality" and "Per sequence quality scores"
- Bugfix. Trimming failed when having paired-end reads due to file naming issues
- Bugfix. BAM sorting failed due changing option in samtools (can only use samtools version 1.3 or higher now)
- Updated README.md
- Create CONFIG.txt instead of config_3.6.txt
- Better organization of the CONFIG.txt file
- Added multiple ways of calling peaks with MACS2
- Added MD5 checksum check on downloaded SRA file
- Create config_3.6.txt instead of config_3.6.2.txt
- Help output now includes description of -n option
- BugFix. BAM not sorted if dup reads not removed
- Improved output to screen for QC analysis
- Bugfix: temp sorted BAM files were not stored in correct directory
- BugFix. Path removed in file base name under -i option
- Bugfix. Bug was introduced with the -i option
- Added -i option to integrate or concatenate multiple FASTQ files into one
- Downloads SRA files via FTP site using wget Before was downloaded via fastq-dump, but was too slow
- Added functionality. Trims adapter sequences if QC on reads fails on "Adapter Content"
- Updated README.md
- Bug fix. The program fastqc was hard-coded
- Bug fix. Couldn't do QC on fastq files alone
- Bug fix. Code not stopping if peak calls failed
- Bug fix. Fixed FASTQ file naming issues
- Corrections made to README.md
- Fixed inconsistensies in README.md
- It can now annotate the DAT file with multiple arbitrary bed files
- Fixed input logic problems
- Enhanced data input. It creates context-based environment, meaning that requires only the minimal amount of inputs, depending on the requested operations
- You can now decide whether to annotate ADB file or not, using the
-n
option
- Added support for gzipped genotyping files
- Added aligning capabilities with the HISAT2 aligner (not recommended to use
- with masked genomes)
- Added quality control of FASTQ files through the
-q
option - Added use of BED file to generate a fake het-SNPs file spanning all positions
in the BED file. This behavior is triggered if no genotyping file is given
or the option
-g
is provided - It now uses "bedtools closest" to annotate positions with disease SNPs and genes
- Major rearrangement of the logic of the program; now has more control on the provided inputs and outputs
Mario Pujato, The MARIO Pipeline, (2018), GitHub repository, https://github.com/WeirauchLab/MARIO
Transcription factors operate across disease loci, with EBNA2 implicated in autoimmunity.
Harley JB, Chen X, Pujato M, Miller D, Maddox A, Forney C, Magnusen AF, Lynch A, Chetal K, Yukawa M, Barski A, Salomonis N, Kaufman KM, Kottyan LC, Weirauch MT.
Nat Genet. 2018 Apr 16. doi: 10.1038/s41588-018-0102-3
PMID: 29662164
Please report any issues with the MARIO pipeline (or feature suggestions) in our GitHub issue tracker.
With other questions, you may contact Dr. Matthew Weirauch via email.
Name | Institution | Remarks |
---|---|---|
Dr. Mario Pujato | Cincinnati Children's Hospital | primary author |