-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Welcome to the CPU wiki!
ChIA-PET Utilities is a collection of efficient specialized programs for processing ChIA-PET data from raw reads to interactions. Essentially, CPU addresses the following tasks:
- trimming sequencing adaptors from paired reads
- identifying the linker(s) to classify the PETs as intra-/inter-molecular
- genome mapping using BWA aln and mem algorithm according to the tag length
- de-duplicating PETs arising from clonal PCR amplification
- clustering of PETs into interactions
- determining the statistical significance of interactions via ChiaSigScaled
If you will like to get a more detail overview (1-5), check out my presentation "ChIA-PET Utilities -- from Reads to Interactions" delivered at the ChIA-PET Workshop 2017.
BWA release to construct the genome index for read mapping.
We have run CPU successfully on Linux (Debian, SUSE, CentOS, Ubuntu) and Mac (El Capitan v10.11.6) system .
C compiler such as GNU C Compiler (gcc).
To install CPU:
tar -zxvf CPU-0.0.1a-r1.tar.gz
# create static compression library
cd CPU-0.0.1a-r1/source/zlib-1.2.8
./configure
make clean
make
cd ../../..
# takes less than a minute
cd CPU-0.0.1a-r1/source
make clean
ln -s zlib-1.2.8/libz.a .
make
./cpu
# takes less than a minute
Due to github constraint on file size, the data in the demo folder is for one to familiarize with / exercise the workflow rather than to extract biological insights. Once you have tried this workflow, you could then work with ChIA-PET data downloaded from public repository.
You can re-use your BWA index for mm10 build with BWA v0.7.10+. Otherwise, you may build it with the provided script as follow:
# CREDIT: Script adapted from Bowtie distribution.
# build genome index
./buildmm10.sh 2>&1 | tee buildmm10.log
To process the demo data "EED_chiapet_rep6.1M.R1.fastq.gz" and "EED_chiapet_rep6.1M.R2.fastq.gz", you may use the provided script as follow:
# process EED_chiapet_rep6.1M.R{1,2}.fastq.gz using 16 threads
./example.sh EED_chiapet_rep6.1M 16
Let's step throught the script "example.sh".
export RUN=${1}
export NTHREAD=${2}
export LOGFILE=${RUN}.cpu.log
This set up the variables based on the parameters passed to the script.
cpu tag trims sequencing adaptors from paired reads. Command tag will additionally identify the linker(s) to classify the PETs as intra-/inter-molecular.
# split tag in read
echo `date` STARTED ${RUN} cpu tags .. | tee -a ${LOGFILE}
./cpu tag -W -T 18 -t ${NTHREAD} -O ${RUN} ${RUN}.R1.fastq.gz ${RUN}.R2.fastq.gz 2>> ${LOGFILE}
echo `date` ENDED ${RUN} cpu tags .. | tee -a ${LOGFILE}
Output | Description |
---|---|
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.fastq (file of interest) |
intra-molecular PET with both tags; interaction signal |
EED_chiapet_rep6.1M.FullLinker.NonChimeric.single.fastq | intra-molecular PET with a single tag; the other tag is too short or unavailable |
EED_chiapet_rep6.1M.fulllinker.chimeric.paired.fastq | inter-molecular PET with both tags; consider random ligation |
EED_chiapet_rep6.1M.fulllinker.chimeric.single.fastq | inter-molecular PET with a single tag; consider random ligation |
EED_chiapet_rep6.1M.none.fastq | PET with no linker identified; can't tell if it is inter-/intra-molecular |
EED_chiapet_rep6.1M.halflinker.fastq | only half linker has been identified |
EED_chiapet_rep6.1M.tied.fastq | more than 1 full linkers identified with the same score |
EED_chiapet_rep6.1M.conflict.fastq | the identified linkers are discordant in its inter-/intra-molecular classification. |
We compressed the identified informative tags to save disk space.
# compress splitted tags
echo `date` 'STARTED Compressing FL NC paired tags ..' | tee -a ${LOGFILE}
gzip -f ${RUN}.FullLinker.NonChimeric.paired.fastq 2>>${LOGFILE}
echo `date` 'ENDED Compressing FL NC paired tags ..' | tee -a ${LOGFILE}
cpu memaln uses BWA aln and mem algorithm according to the tag length.
# mapping
echo `date` STARTED Mapping FL NC paired tags .. | tee -a ${LOGFILE}
./cpu memaln -T 15 -t ${NTHREAD} mm10 ${RUN}.FullLinker.NonChimeric.paired.fastq.gz 2>> ${LOGFILE} 1>${RUN}.FullLinker.NonChimeric.paired.sam
echo `date` ENDED Mapping FL NC paired tags .. | tee -a ${LOGFILE}
Again, we compress the alignment output (.sam) to save disk space.
echo `date` 'STARTED Compressing FL NC memaln sam file ..' | tee -a ${LOGFILE}
gzip ${RUN}.FullLinker.NonChimeric.paired.sam 2>&1 | tee -a ${LOGFILE}
echo `date` 'ENDED Compressing FL NC memaln sam file ..' | tee -a ${LOGFILE}
cpu pair considers the mapping results of the pair-tags that have been mapped independently and group the PETs broadly into (1) both tags originate from a unique genmoic location, (2) one of the tags from the pair-tags originates from a unique genomic location, (3) both tags from the pair-tags mapped to multiple genomic locations, and (4) one or both tags from the pair-tags do not map to the reference genome.
# pairing
echo `date` STARTED Pairing FL NC paired tags .. | tee -a ${LOGFILE}
./cpu pair -S -t ${NTHREAD} ${RUN}.FullLinker.NonChimeric.paired.sam.gz 1>${RUN}.FullLinker.NonChimeric.paired.stat.xls 2>> ${LOGFILE}
echo `date` ENDED Pairing FL NC paired tags .. | tee -a ${LOGFILE}
Output | Description |
---|---|
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.UU.bam (file of interest) |
contains PETs which both tags mapped to a single/unique genomic location |
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.UxxU.bam | contains PETs which only one of the pair-tags mapped to a single/unique genomic location |
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.xx.bam | contains PETs which none of the pair-tags mapped to a single/unique genomic location |
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.nn.bam | contains PETs which none of the pair-tags mapped to the reference genome |
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.UU.failed.bam | contains PETs which both tags mapped to a single/unique genomic location BUT which one or both tags whose best mapping result does not distinguish very well from the second best mapping result(s) |
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.UU.discordant.bam | contains PETs which both tags mapped to a single/unique genomic location BUT which one tag extractable from both Read/1 and Read/2 disagree on the uniquely mapped genomic location |
cpu dedup removes potential duplicates from PCR. cpu applies the standard pair-end reads deduplication consideration of 5' mapping coordinate of Read/1 and Read/2 on the pair-tags. Additionally, cpu consider the identified linker(s) and also the 3' end of tag if linker has been identified adjacent to the tag.
# deduplication
echo `date` STARATED De-duplicating FL NC paired tags UU .. | tee -a ${LOGFILE}
./cpu dedup -g -t ${NTHREAD} ${RUN}.FullLinker.NonChimeric.paired.UU.bam 1>${RUN}.FullLinker.NonChimeric.paired.UU.dedup.lc 2>> ${LOGFILE}
echo `date` ENDED De-duplicating FL NC paired tags UU .. | tee -a ${LOGFILE}
echo `date` removing transient file .. | tee -a ${LOGFILE}
rm ${RUN}.FullLinker.NonChimeric.paired.UU.dedup 2>> ${LOGFILE}
Output | Description |
---|---|
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.UU.nr.bam (file of interest) |
contains PETs which are considered non-redundant |
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.UU.dup.bam | contains PETs which are consider PCR duplicates |
cpu cluster collates the non-redundant uniquely-mapped pair-end tags to produce cis-interactions and trans-interactions both merging pair-end tags whose both anchors overlap or within x bases on the reference genome coordiates. (1000 in this example; from -5 and -3 options we have 980-(-20)=1000.)
# cluster tags
echo `date` STARTED ${RUN} cpu clustering.. | tee -a ${LOGFILE}
./cpu cluster -g -M -B 1000 -5 5,-20 -3 5,980 -t 1 -O ${RUN}.e1K ${RUN}.FullLinker.NonChimeric.paired.UU.nr.bam 2>&1 | tee -a ${LOGFILE}
echo `date` ENDED ${RUN} cpu clustering.. | tee -a ${LOGFILE}
echo `date` removing transient file .. | tee -a ${LOGFILE}
rm ${RUN}.e1K.cpu.cluster 2>> ${LOGFILE}
After decompressing the collection of clustered PETs which both tags mapped to the same chromosome (i.e. potential cis-interactions), ChiaSig is invoked to assess the interaction statistical significance.
# generate clustered tags for ChiaSig
echo `date` Decompressing cis interaction clusters for ChiaSig .. | tee -a ${LOGFILE}
gunzip ${RUN}.e1K.clusters.cis.chiasig.gz
echo `date` STARTED ChiaSig on cis interactions .. | tee -a ${LOGFILE}
./ChiaSig -c 2 -p -t ${NTHREAD} ${RUN}.e1K.clusters.cis.chiasig >${RUN}.e1K.clusters.cis.interaction 2>${RUN}.e1K.clusters.cis.interaction.err
echo `date` ENDED ChiaSig on cis interactions .. | tee -a ${LOGFILE}
See ChiaSig Wiki to familiarize yourself with its workflow.