Skip to content

gx-health/TAGET

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TAGET user manual

TAGET is a computational toolkit that provides a wide spectrum of tools for analyzing full-length transcriptome data. Based on its highly precise transcript alignment and junction prediction, TAGET enables accurate novel isoform, gene fusion detection, and expression quantification analyses

Environmental dependence

  • HISAT2/MINIMAP2/GMAP at least one
  • samtools
  • python3
  • R>=3.3
  • Linux centos
  • Python packages:rpy2,pandas,numpy
  • R packages:stringr,optparse,DEGseq

FAST RUN

python TransAnnot.py -f [fasta] -g [genome fasta] -o [output directory] -a[annot gtf] -p [process] --use_minimap2 [1] --use_hisat2 [hisat2 index]

or you can use

python TransAnnot.py -c TransAnnot.Config

Running time

The running time is about less than 1 hours with 8 core on a Linux server

software running

1.the config file contain environmental path of each software and the index file of the reference genome

  1. you can set the following parameters at the first time

    • the path of HISAT2/Minimap2
    • the index file of reference genome
    • reference genome(FASTA)、anotiation of transcript file default Ensemble(GTF)、process number
  2. After setting the base parameters,you can set the fasta file of the full length transcript and the output dictionary or you can use -c config and -f [fatsa] -o [output]

  3. The reads\transcript\gene expression can be caculated by the parameter of --tpm

running result

The output files contain the following files:

{sample_id}.annot.stat each coloumn:

  • ID: reads ID
  • Classification: classificatioin of reads
  • Subtype: subtype of reads
  • Gene: gene annotation or region in genome[chr1:100000-100500]
  • Transcript: transcript annotation
  • Chrom: chromosome
  • Strand: strand
  • Seq_length: reads length
  • Seq_exon: exon number of reads
  • Ref_length: length of transcript annotation
  • Ref_exon_num: exon number of transcript annotatioin
  • diff_to_gene_start: 5` site difference of reads and annotation gene in reference genome
  • diff_to_gene_end: 3` site difference of reads and annotation gene in reference genome
  • diff_to_transcript_start: 5` site difference of reads and annotation transcript in reference genome
  • diff_to_transcript_end: 3` site difference of reads and annotation transcript in reference genome
  • exon_miss_to_transcript_start: number of exon missed in 5` site between reads and transcript annotation
  • exon_miss_to_transcript_end: number of exon missed in 3` site between reads and transcript annotation

TransAnnot.Config

  • FASTA: [path],input file,fasta format of full length transcript
  • OUTPUT_DIR: [path],the output dictionary
  • GENOME_FA: [path],the fasta file of reference genome (eg,hg38.fa)
  • GTF_ANNOTATION:[path],the annotion file of gene default gtf format
  • PROCESS: [int],the number of process
  • SAMPLE_UNIQUE_NAME:[string],the output prefix of each files
  • PYTHON:[path],the pathway of python
  • TAGET_DIR:[path],the pathway of TAGET
  • SAMTOOLS:[path],the pathway of samtools
  • USE_HISAT2: [int],wether or not use HISAT2, 1 means use,0 means not use
  • HISAT2: [path],the pathway of Hisat2
  • HISAT2_INDEX: [path],the pathway of index of Hisat2 ,generated by hisat2-build
  • USE_MINIMAP2: [int],wether or not use minimap2, 1 means use,0 means not use
  • MINIMAP2: [path],the pathway of Minimap2
  • USE_GMAP: [int],wether or not use GMAP, 1 means use,0 means not use
  • GMAP: [path],the pathway of GMAP
  • GMAP_INDEX: [path],the pathway of GMAP index,generated by gmap_build
  • TPM_LIST: [path],the expression of Isoform
  • READ_LENGTH: [int],the read length used by HISAT2,default 100
  • READ_OVERLAP: [int] the read overlap used by HISAT2, default 80
  • MIN_READ_LENGTH: [int],the minimum length of read,default 30

TransAnnotMerge

We can use TransAnnotMerge to generate expression matrix of multi-samples

FAST RUN

extract isoform expression from fasta file: python fa2exp.py -f [fa] -i [prefix] -o [exp] -p [taget output dictionary]

  • -f: full length transcript fasta format file
  • -o: output dictionary
  • -p: prefix of {sample_uniqe_name}.anno files

python script.py input.config

This step needs to use exon.gtf file,which can be unzipped by using unzip exon.gtf.zip

Usage of TransAnnotMerge

python TranAnnotMerge -c MergeConfig -o outputdir -m [TPM/FLC/None]

#sample stat bed db
------- ---- --- --
  • -o: the output dictionary
  • -m:the gene and transcript expression displayed by different methods,FLC: full length count,if none is not expression matrix

TransAnnotMerge running

  1. extract isoform expression from fasta file:python fa2exp.py -f [fa] -o [exp]
  2. running TranAnnotMerge: python TranAnnotMerge -c MergeConfig -o outputdir -m TPM

TransAnnotMerge result file

  • {sample_id}.reads.exp: the read expression of each file
  • {sample_id}.transcript.exp: the transcript expression of each file
  • {sample_id}.gene.exp: the gene expression of each file
  • gene.exp: the gene expression matrix of each sample
  • transcript.exp: the transcirpt expression matrix of each sample
  • merge.db.pickle: view transcirpt

DIU analysis

python expression_V1.py -t {sample_id}.transcript.exp -g {sample_id}.gene.exp -o {prefix}

  • -t transcript expression of tumor and normal
  • -g gene expression of tumor and normal
  • -o prefix of outfile
  • -r default 0.05 filter the low express transcript
  • -p default 50 filter the low express gene

the classfication of isoform annotated by TAGET

the classfication of transcript

  • FSM: full splice site match
  • ISM: incomplete splice site match
  • NIC: novel in catalog
  • NNC: novel not in catalog
  • GENIC: genic
  • INTERGENIC: intergenic
  • FUSION: fusion
  • UNKNOWN: unknown

the classfication of exon

  • KE: known exon
  • LEKE: left end known exon
  • REKE: right end known exon
  • NEKSLE: novel exon with known splice site in left end exon and has the unique region overlap with at least two known exons
  • NEKSRE: novel exon known splice site in right end exon and has the unique region overlap with at least two known exons
  • IE: intron retention: two known splice sites from the same transcript's sequential exon
  • NEDT: novel exon with two known splice sites from different transcript
  • NELS: novel exon with novel left splice site
  • NERS: novel exon with novel right splice site
  • LEE: left exon_extension: the novel splice site in the left end of the exon which is longer than any exons overlap with it
  • REE: right exon_extension: the novel splice site in the right end of the exon which is longer than any exons overlap with it
  • NEDS: novel exon:double novel splice sites overlap with at least one known exon
  • NEIG: novel exon inner-gene:novel exon inside the gene and without any overlap with known exon
  • NEOG: novel exon inter-gene:novel exon outside the gene
  • NELE: novel exon with novel splice site in the far left exon
  • NERE: novel exon with novel splice site in the far right exon
  • MDNS: monoexon with double novel splice sites

TAGET gene fusion detect and filter

Rscript TAGET_fusion_2-3_ajust.r -j Jin_fusion_select.py -e STAT_select.py -c chuli.py -l ${i}.fa.minimap2.bed -s ${i}.fa.hisat2.bed -a ${i}.fa.anno.tmp.stat -t hg38.gtf -f ${i}.fa -n ${i} -o ./output

  • ${i}.fa.long.bed: the bed file mapped by minimap2 generated by TAGET
  • ${i}.fa.short.bed:the bed file mapped by HISAT2 generated by TAGET
  • ${i}.fa: the CCS read from Pacbio platform
  • ${i}: the prefix of generated file name

An example of TAGET

We can download data.7z,demo.7z and script.7z to run the demo, the running time is about less than 1hours with 8 core on a Linux server.Details can be seen example.readme. Reference genome can be downlaod from https://disk.pku.edu.cn:443/link/1F62976F65C4EA81C4C06A05E245049D

Reference genome

Here we used hg38 to annotate transcripts,human reference hg38.fa and Ensemble gtf format files were needed. HISAT2 and minimap2 need to index this reference. The pickle file of gtf format file can be generated by using gtf_db_make.py python gtf_db_make.py hg38.ensembl.gtf hg38.ensembl.v20200306.1.pickle #Demos ###dependency HISAT2 v.2.2.1 MINIMAP2 v2.24 samtools v1.19 python3.9 R3.5 Linux cento OS7 Python packages: rpy2 v3.3.3 pandas v1.2.3 numpy v1.22.0 R packages: stringr v1.5.0 optparse v1.7.3 DEGseq v1.12

TAGET user manual

TAGET is a computational toolkit that provides a wide spectrum of tools for analyzing full-length transcriptome data. Based on its highly precise transcript alignment and junction prediction, TAGET enables accurate novel isoform, gene fusion detection, and expression quantification analyses

Environmental dependence

  • HISAT2/MINIMAP2/GMAP at least one
  • samtools
  • python3
  • R>=3.3
  • Linux cento OS
  • Python packages:rpy2,pandas,numpy
  • R packages:stringr,optparse,DEGseq

FAST RUN

python TransAnnot.py -f [fasta] -g [genome fasta] -o [output directory] -a[annot gtf] -p [process] --use_minimap2 [1] --use_hisat2 [hisat2 index]

or you can use

python TransAnnot.py -c TransAnnot.Config

Running time

The running time is about less than 1 hours with 8 core on a Linux server

software running

1.the config file contain environmental path of each software and the index file of the reference genome

  1. you can set the following parameters at the first time

    • the path of HISAT2/Minimap2
    • the index file of reference genome
    • reference genome(FASTA)、anotiation of transcript file default Ensemble(GTF)、process number
  2. After setting the base parameters,you can set the fasta file of the full length transcript and the output dictionary or you can use -c config and -f [fatsa] -o [output]

  3. The reads\transcript\gene expression can be caculated by the parameter of --tpm

running result

The output files contain the following files:

{sample_id}.annot.stat each coloumn:

  • ID: reads ID
  • Classification: classificatioin of reads
  • Subtype: subtype of reads
  • Gene: gene annotation or region in genome[chr1:100000-100500]
  • Transcript: transcript annotation
  • Chrom: chromosome
  • Strand: strand
  • Seq_length: reads length
  • Seq_exon: exon number of reads
  • Ref_length: length of transcript annotation
  • Ref_exon_num: exon number of transcript annotatioin
  • diff_to_gene_start: 5` site difference of reads and annotation gene in reference genome
  • diff_to_gene_end: 3` site difference of reads and annotation gene in reference genome
  • diff_to_transcript_start: 5` site difference of reads and annotation transcript in reference genome
  • diff_to_transcript_end: 3` site difference of reads and annotation transcript in reference genome
  • exon_miss_to_transcript_start: number of exon missed in 5` site between reads and transcript annotation
  • exon_miss_to_transcript_end: number of exon missed in 3` site between reads and transcript annotation

TransAnnot.Config

  • FASTA: [path],input file,fasta format of full length transcript
  • OUTPUT_DIR: [path],the output dictionary
  • GENOME_FA: [path],the fasta file of reference genome (eg,hg38.fa)
  • GTF_ANNOTATION:[path],the annotion file of gene default gtf format
  • PROCESS: [int],the number of process
  • SAMPLE_UNIQUE_NAME:[string],the output prefix of each files
  • PYTHON:[path],the pathway of python
  • TAGET_DIR:[path],the pathway of TAGET
  • SAMTOOLS:[path],the pathway of samtools
  • USE_HISAT2: [int],wether or not use HISAT2, 1 means use,0 means not use
  • HISAT2: [path],the pathway of Hisat2
  • HISAT2_INDEX: [path],the pathway of index of Hisat2 ,generated by hisat2-build
  • USE_MINIMAP2: [int],wether or not use minimap2, 1 means use,0 means not use
  • MINIMAP2: [path],the pathway of Minimap2
  • USE_GMAP: [int],wether or not use GMAP, 1 means use,0 means not use
  • GMAP: [path],the pathway of GMAP
  • GMAP_INDEX: [path],the pathway of GMAP index,generated by gmap_build
  • TPM_LIST: [path],the expression of Isoform
  • READ_LENGTH: [int],the read length used by HISAT2,default 100
  • READ_OVERLAP: [int] the read overlap used by HISAT2, default 80
  • MIN_READ_LENGTH: [int],the minimum length of read,default 30

TransAnnotMerge

We can use TransAnnotMerge to generate expression matrix of multi-samples

FAST RUN

extract isoform expression from fasta file: python fa2exp.py -f [fa] -i [prefix] -o [exp] -p [taget output dictionary]

  • -f: full length transcript fasta format file
  • -o: output dictionary
  • -p: prefix of {sample_uniqe_name}.anno files

python script.py input.config

This step needs to use exon.gtf file,which can be unzipped by using unzip exon.gtf.zip

Usage of TransAnnotMerge

python TranAnnotMerge -c MergeConfig -o outputdir -m [TPM/FLC/None]

#sample stat bed db
------- ---- --- --
  • -o: the output dictionary
  • -m:the gene and transcript expression displayed by different methods,FLC: full length count,if none is not expression matrix

TransAnnotMerge running

  1. extract isoform expression from fasta file:python fa2exp.py -f [fa] -o [exp]
  2. running TranAnnotMerge: python TranAnnotMerge -c MergeConfig -o outputdir -m TPM

TransAnnotMerge result file

  • {sample_id}.reads.exp: the read expression of each file
  • {sample_id}.transcript.exp: the transcript expression of each file
  • {sample_id}.gene.exp: the gene expression of each file
  • gene.exp: the gene expression matrix of each sample
  • transcript.exp: the transcirpt expression matrix of each sample
  • merge.db.pickle: view transcirpt

DIU analysis

python expression_V1.py -t {sample_id}.transcript.exp -g {sample_id}.gene.exp -o {prefix}

  • -t transcript expression of tumor and normal
  • -g gene expression of tumor and normal
  • -o prefix of outfile
  • -r default 0.05 filter the low express transcript
  • -p default 50 filter the low express gene

the classfication of isoform annotated by TAGET

the classfication of transcript

  • FSM: full splice site match
  • ISM: incomplete splice site match
  • NIC: novel in catalog
  • NNC: novel not in catalog
  • GENIC: genic
  • INTERGENIC: intergenic
  • FUSION: fusion
  • UNKNOWN: unknown

the classfication of exon

  • KE: known exon
  • LEKE: left end known exon
  • REKE: right end known exon
  • NEKSLE: novel exon with known splice site in left end exon and has the unique region overlap with at least two known exons
  • NEKSRE: novel exon known splice site in right end exon and has the unique region overlap with at least two known exons
  • IE: intron retention: two known splice sites from the same transcript's sequential exon
  • NEDT: novel exon with two known splice sites from different transcript
  • NELS: novel exon with novel left splice site
  • NERS: novel exon with novel right splice site
  • LEE: left exon_extension: the novel splice site in the left end of the exon which is longer than any exons overlap with it
  • REE: right exon_extension: the novel splice site in the right end of the exon which is longer than any exons overlap with it
  • NEDS: novel exon:double novel splice sites overlap with at least one known exon
  • NEIG: novel exon inner-gene:novel exon inside the gene and without any overlap with known exon
  • NEOG: novel exon inter-gene:novel exon outside the gene
  • NELE: novel exon with novel splice site in the far left exon
  • NERE: novel exon with novel splice site in the far right exon
  • MDNS: monoexon with double novel splice sites

TAGET gene fusion detect and filter

Rscript TAGET_fusion_2-3_ajust.r -j Jin_fusion_select.py -e STAT_select.py -c chuli.py -l ${i}.fa.minimap2.bed -s ${i}.fa.hisat2.bed -a ${i}.fa.anno.tmp.stat -t hg38.gtf -f ${i}.fa -n ${i} -o ./output

  • ${i}.fa.long.bed: the bed file mapped by minimap2 generated by TAGET
  • ${i}.fa.short.bed:the bed file mapped by HISAT2 generated by TAGET
  • ${i}.fa: the CCS read from Pacbio platform
  • ${i}: the prefix of generated file name

An example of TAGET

We can download data.7z,demo.7z and script.7z to run the demo, the running time is about less than 1hours with 8 core on a Linux server.Details can be seen example.readme. Reference genome can be downlaod from https://disk.pku.edu.cn:443/link/5E0152D82C71B992690CFA9D7A3B5CF8 https://disk.pku.edu.cn:443/link/5E0152D82C71B992690CFA9D7A3B5CF8 https://zenodo.org/records/10091914

Reference genome

Here we used hg38 to annotate transcripts,human reference hg38.fa and Ensemble gtf format files were needed. HISAT2 and minimap2 need to index this reference. The pickle file of gtf format file can be generated by using gtf_db_make.py python gtf_db_make.py hg38.ensembl.gtf hg38.ensembl.v20200306.1.pickle

Demos

dependency

  • HISAT2 v.2.2.1
  • MINIMAP2 v2.24
  • samtools v1.19
  • python3.9
  • R3.5
  • Linux centos7
  • Python packages:
  • rpy2 v3.3.3
  • pandas v1.2.3
  • numpy v1.22.0
  • R packages:
  • stringr v1.5.0
  • optparse v1.7.3
  • DEGseq v1.12

1 Fast run

python TransAnnot.py -c 759133C.Config python TransAnnot.py -c 759133N.Config running time:72 minutes outputs: the dictionary of 759133C 759133C.minimap2.bed 759133C.hisat2.bed 759133C.annot.bed 759133C.annot.stat 759133C.annot.db.pickle 759133C.annot.cluster.gene 759133C.annot.cluster.transcript 759133C.annot.cluster.reads 759133C.annot.junction 759133C.annot.multiAnno 759133C.anno.tmp.stat

the dictionary of 759133N 759133N.minimap2.bed 759133N.hisat2.bed 759133N.annot.bed 759133N.annot.stat 759133N.annot.db.pickle 759133N.annot.cluster.gene 759133N.annot.cluster.transcript 759133N.annot.cluster.reads 759133N.annot.junction 759133N.annot.multiAnno 759133N.anno.tmp.stat

2.TransAnnotMerge

python fa2exp.py -f 759133C.fa -i 759133C -o 759133C -p ./expression python fa2exp.py -f 759133N.fa -i 759133N -o 759133N -p ./expression python script.py input.config running time:35 minutes outputs 759133.reads.exp 759133.transcript.exp

3.DIU analysis python expression_V1.py -t 759133.transcript.exp -g 759133.gene.exp -o 759133 running time:2 minutes outputs: 759133_DIU.txt

4 gene fusion

Rscript TAGET_fusion_2-3_ajust.r -j Jin_fusion_select.py -e STAT_select.py -c chuli.py -l 759133C.minimap2.bed -s 759133C.hisat2.bed -a 759133C.fa.anno.tmp.stat -t hg38.gtf -f 759133C.fa -n 759133C -o ./output running time:18 minutes output 759133C.fusion

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published