Skip to content
Chee-Hong Wong edited this page Sep 25, 2018 · 3 revisions

Welcome to the CPU wiki!

What is CPU?

ChIA-PET Utilities is a collection of efficient specialized programs for processing ChIA-PET data from raw reads to interactions. Essentially, CPU addresses the following tasks:

  1. trimming sequencing adaptors from paired reads
  2. identifying the linker(s) to classify the PETs as intra-/inter-molecular
  3. genome mapping using BWA aln and mem algorithm according to the tag length
  4. de-duplicating PETs arising from clonal PCR amplification
  5. clustering of PETs into interactions
  6. determining the statistical significance of interactions via ChiaSigScaled

If you will like to get a more detail overview (1-5), check out my presentation "ChIA-PET Utilities -- from Reads to Interactions" delivered at the ChIA-PET Workshop 2017.

Download

CPU release

ChiaSigScaled release

BWA release to construct the genome index for read mapping.

Installation

We have run CPU successfully on Linux (Debian, SUSE, CentOS, Ubuntu) and Mac (El Capitan v10.11.6) system .

C compiler such as GNU C Compiler (gcc).

To install CPU:

tar -zxvf CPU-0.0.1a-r1.tar.gz

# create static compression library
cd CPU-0.0.1a-r1/source/zlib-1.2.8
./configure
make clean
make
cd ../../..
# takes less than a minute

cd CPU-0.0.1a-r1/source
make clean
ln -s zlib-1.2.8/libz.a .
make
./cpu
# takes less than a minute

Quick start

Demo data

Due to github constraint on file size, the data in the demo folder is for one to familiarize with / exercise the workflow rather than to extract biological insights. Once you have tried this workflow, you could then work with ChIA-PET data downloaded from public repository.

Mouse BWA index construction

You can re-use your BWA index for mm10 build with BWA v0.7.10+. Otherwise, you may build it with the provided script as follow:

# CREDIT: Script adapted from Bowtie distribution.
# build genome index
./buildmm10.sh 2>&1 | tee buildmm10.log

ChIA-PET data processing

To process the demo data "EED_chiapet_rep6.1M.R1.fastq.gz" and "EED_chiapet_rep6.1M.R2.fastq.gz", you may use the provided script as follow:

# process EED_chiapet_rep6.1M.R{1,2}.fastq.gz using 16 threads
./example.sh EED_chiapet_rep6.1M 16

Let's step throught the script "example.sh".

export RUN=${1}
export NTHREAD=${2}
export LOGFILE=${RUN}.cpu.log

This set up the variables based on the parameters passed to the script.

Step 1 & 2 : adapter trimming and linker(s) identification

cpu tag trims sequencing adaptors from paired reads. Command tag will additionally identify the linker(s) to classify the PETs as intra-/inter-molecular.

# split tag in read
echo `date` STARTED ${RUN} cpu tags .. | tee -a ${LOGFILE}
./cpu tag -W -T 18 -t ${NTHREAD} -O ${RUN} ${RUN}.R1.fastq.gz ${RUN}.R2.fastq.gz 2>> ${LOGFILE}
echo `date` ENDED ${RUN} cpu tags .. | tee -a ${LOGFILE}
Output Description
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.fastq
(file of interest)
intra-molecular PET with both tags; interaction signal
EED_chiapet_rep6.1M.FullLinker.NonChimeric.single.fastq intra-molecular PET with a single tag; the other tag is too short or unavailable
EED_chiapet_rep6.1M.fulllinker.chimeric.paired.fastq inter-molecular PET with both tags; consider random ligation
EED_chiapet_rep6.1M.fulllinker.chimeric.single.fastq inter-molecular PET with a single tag; consider random ligation
EED_chiapet_rep6.1M.none.fastq PET with no linker identified; can't tell if it is inter-/intra-molecular
EED_chiapet_rep6.1M.halflinker.fastq only half linker has been identified
EED_chiapet_rep6.1M.tied.fastq more than 1 full linkers identified with the same score
EED_chiapet_rep6.1M.conflict.fastq the identified linkers are discordant in its inter-/intra-molecular classification.

We compressed the identified informative tags to save disk space.

# compress splitted tags
echo `date` 'STARTED Compressing FL NC paired tags ..' | tee -a ${LOGFILE}
gzip -f ${RUN}.FullLinker.NonChimeric.paired.fastq 2>>${LOGFILE}
echo `date` 'ENDED Compressing FL NC paired tags ..' | tee -a ${LOGFILE}

Step 3 : Independent genome mapping of pair-tags separately

cpu memaln uses BWA aln and mem algorithm according to the tag length.

# mapping
echo `date` STARTED Mapping FL NC paired tags .. | tee -a ${LOGFILE}
./cpu memaln -T 15 -t ${NTHREAD} mm10 ${RUN}.FullLinker.NonChimeric.paired.fastq.gz 2>> ${LOGFILE} 1>${RUN}.FullLinker.NonChimeric.paired.sam
echo `date` ENDED Mapping FL NC paired tags .. | tee -a ${LOGFILE}

Again, we compress the alignment output (.sam) to save disk space.

echo `date` 'STARTED Compressing FL NC memaln sam file ..' | tee -a ${LOGFILE}
gzip ${RUN}.FullLinker.NonChimeric.paired.sam 2>&1  | tee -a ${LOGFILE}
echo `date` 'ENDED Compressing FL NC memaln sam file ..' | tee -a ${LOGFILE}

cpu pair considers the mapping results of the pair-tags that have been mapped independently and group the PETs broadly into (1) both tags originate from a unique genmoic location, (2) one of the tags from the pair-tags originates from a unique genomic location, (3) both tags from the pair-tags mapped to multiple genomic locations, and (4) one or both tags from the pair-tags do not map to the reference genome.

# pairing
echo `date` STARTED Pairing FL NC paired tags .. | tee -a ${LOGFILE}
./cpu pair -S -t ${NTHREAD} ${RUN}.FullLinker.NonChimeric.paired.sam.gz 1>${RUN}.FullLinker.NonChimeric.paired.stat.xls 2>> ${LOGFILE}
echo `date` ENDED Pairing FL NC paired tags .. | tee -a ${LOGFILE}
Output Description
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.UU.bam
(file of interest)
contains PETs which both tags mapped to a single/unique genomic location
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.UxxU.bam contains PETs which only one of the pair-tags mapped to a single/unique genomic location
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.xx.bam contains PETs which none of the pair-tags mapped to a single/unique genomic location
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.nn.bam contains PETs which none of the pair-tags mapped to the reference genome
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.UU.failed.bam contains PETs which both tags mapped to a single/unique genomic location BUT which one or both tags whose best mapping result does not distinguish very well from the second best mapping result(s)
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.UU.discordant.bam contains PETs which both tags mapped to a single/unique genomic location BUT which one tag extractable from both Read/1 and Read/2 disagree on the uniquely mapped genomic location

Step 4 : PETs de-duplication

cpu dedup removes potential duplicates from PCR. cpu applies the standard pair-end reads deduplication consideration of 5' mapping coordinate of Read/1 and Read/2 on the pair-tags. Additionally, cpu consider the identified linker(s) and also the 3' end of tag if linker has been identified adjacent to the tag.

# deduplication
echo `date` STARATED De-duplicating FL NC paired tags UU .. | tee -a ${LOGFILE}
./cpu dedup -g -t ${NTHREAD} ${RUN}.FullLinker.NonChimeric.paired.UU.bam 1>${RUN}.FullLinker.NonChimeric.paired.UU.dedup.lc 2>> ${LOGFILE}
echo `date` ENDED De-duplicating FL NC paired tags UU .. | tee -a ${LOGFILE}

echo `date` removing transient file .. | tee -a ${LOGFILE}
rm ${RUN}.FullLinker.NonChimeric.paired.UU.dedup 2>> ${LOGFILE}
Output Description
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.UU.nr.bam
(file of interest)
contains PETs which are considered non-redundant
EED_chiapet_rep6.1M.FullLinker.NonChimeric.paired.UU.dup.bam contains PETs which are consider PCR duplicates

Step 5 : PETs clustering

cpu cluster collates the non-redundant uniquely-mapped pair-end tags to produce cis-interactions and trans-interactions both merging pair-end tags whose both anchors overlap or within x bases on the reference genome coordiates. (1000 in this example; from -5 and -3 options we have 980-(-20)=1000.)

# cluster tags
echo `date` STARTED ${RUN} cpu clustering.. | tee -a ${LOGFILE}
./cpu cluster -g -M -B 1000 -5 5,-20 -3 5,980 -t 1 -O ${RUN}.e1K ${RUN}.FullLinker.NonChimeric.paired.UU.nr.bam 2>&1 | tee -a ${LOGFILE}
echo `date` ENDED ${RUN} cpu clustering.. | tee -a ${LOGFILE}

echo `date` removing transient file .. | tee -a ${LOGFILE}
rm ${RUN}.e1K.cpu.cluster 2>> ${LOGFILE}

Step 6 : Interaction's statistical significance

After decompressing the collection of clustered PETs which both tags mapped to the same chromosome (i.e. potential cis-interactions), ChiaSig is invoked to assess the interaction statistical significance.

# generate clustered tags for ChiaSig
echo `date` Decompressing cis interaction clusters for ChiaSig .. | tee -a ${LOGFILE}
gunzip ${RUN}.e1K.clusters.cis.chiasig.gz

echo `date` STARTED ChiaSig on cis interactions .. | tee -a ${LOGFILE}
./ChiaSig -c 2 -p -t ${NTHREAD} ${RUN}.e1K.clusters.cis.chiasig >${RUN}.e1K.clusters.cis.interaction 2>${RUN}.e1K.clusters.cis.interaction.err
echo `date` ENDED ChiaSig on cis interactions .. | tee -a ${LOGFILE}

See ChiaSig Wiki to familiarize yourself with its workflow.