DeeReCT-TSS: A novel meta-learning-based method annotates TSS in multiple cell types based on DNA sequences and RNA-seq data

Overview

This repository contains the implementation of DeeReCT-TSS from

Juexiao Zhou, Bin Zhang, et al. "Annotating TSSs in Multiple Cell Types Based on DNA Sequence and RNA-seq Data via DeeReCT-TSS"

If you use our work in your research, please consider cite our paper:

@article{zhou2022annotating,
  title={Annotating TSSs in Multiple Cell Types Based on DNA Sequence and RNA-seq Data via DeeReCT-TSS},
  author={Zhou, Juexiao and Zhang, Bin and Li, Haoyang and Zhou, Longxi and Li, Zhongxiao and Long, Yongkang and Han, Wenkai and Wang, Mengran and Cui, Huanhuan and Li, Jingjing and others},
  journal={Genomics, Proteomics \& Bioinformatics},
  year={2022},
  publisher={Elsevier}
}

Prerequisites

The code is tested with the following dependencies:

python=3.6
biopython=1.78
bedtools=2.30.0
cudatoolkit=10.1.243
cudnn=7.6.5
numpy=1.19.2
scipy=1.5.2
pandas=1.1.3
scipy=1.5.2
scikit-learn 0.22.1
tensorflow-gpu=1.14.0
Seaborn 0.11.1
matplotlib=3.3.4
seaborn=0.11.1
samtools

The code is not guaranteed to work if different versions are used.

To analyze bam files with a size around 10G, each thread requires 4-5G memory when the job is splitted into 25 threads.

Genome-wide TSS Scanning

bash ./run.sh \
      path/to/Aligned.sortedByCoord.out.bam \  #(the aligned RNA-Seq bam file)
      path/to/gencode.v38.pcg.extups5k.bed \  #(regions for scanning, a example file of all protein coding genes is provided under the folder /ref)
      path/to/model.npz \  #(the pre-trained models are provided under the folder /model)
      path/to/reference_genome.fa \ #(reference genome sequencing in the "FASTA" format, a example file is provided under the folder /ref)
      path/to/output \
      0/1/2 ifstranded \
      25  #(number of threads)

eg:

bash ./run.sh \
      ../DeeReCT-TSS_release/data/TCGA-AA-3517-11A-01R-A32Z-07/rnaseq/Aligned.sortedByCoord.out.bam \
      ../DeeReCT-TSS_release/ref/gencode.v38.pcg.extups5k.bed \
      ../DeeReCT-TSS_release/model/colon_model/model_best.npz \
      ../DeeReCT-TSS_release/ref/hg38/hg38.fa \
      ./test_out/ \
      0 \
      25

Reference preparation

The reference genome file can be download from "https://www.gencodegenes.org" or other database. i.e. Ensembl, UCSC and NCBI.

The file marking the regions for scanning should be in "BED" format. A simple way to generate the file for scanning all protein coding genes is shown below:

1, Download gene annotation (gtf file) from "http:https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz"

2, Select the rows including gene information, filter out the protein coding genes, extend 5kb from the gene start and convert to "BED" format

zcat gencode.v38.annotation.gtf.gz | awk '$3 == "gene"' | grep "protein_coding" | awk '{OFS="\t"} {if($6 == "+") print $1,$4-5000,$5,$10":"$14,$12,$7; else print $1,$4,$5+5000,$10":"$14,$12,$7}' | sed s/[\"\;]//g > gencode.v38.pcg.extups5k.bed

Output

There will be two output files under the output directory: "combined.raw.prediction" and "combined.predicted.cluster"

The former one is the raw prediction score in a extended bedgraph format from the deep learning model. Column 1-3 indicate the genomic coordiante, column 4 is the predcition score, and column 5 is the strand.

The later one is the final predicted TSS after clustering the raw prediction score in a extended bed format. Column 1-3 indicate the TSS loci, column 4 is the gene where the predicted TSS is associated, column 5 is the clustered prediction score, column 6 is the strand, and column 7 is a empirical P value.

Acknowledgement

This project is supported by KAUST and SUSTech.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
README.assets		README.assets
legacy		legacy
model		model
ref		ref
src		src
.DS_Store		.DS_Store
README.md		README.md
run.sh		run.sh
run1thread.sh		run1thread.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeeReCT-TSS: A novel meta-learning-based method annotates TSS in multiple cell types based on DNA sequences and RNA-seq data

Overview

Prerequisites

Genome-wide TSS Scanning

Reference preparation

Output

Acknowledgement

About

Releases

Packages

Contributors 3

Languages

JoshuaChou2018/DeeReCT-TSS_release

Folders and files

Latest commit

History

Repository files navigation

DeeReCT-TSS: A novel meta-learning-based method annotates TSS in multiple cell types based on DNA sequences and RNA-seq data

Overview

Prerequisites

Genome-wide TSS Scanning

Reference preparation

Output

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages