TAGET is a computational toolkit that provides a wide spectrum of tools for analyzing full-length transcriptome data. Based on its highly precise transcript alignment and junction prediction, TAGET enables accurate novel isoform, gene fusion detection, and expression quantification analyses
- HISAT2/MINIMAP2/GMAP at least one
- samtools
- python3
- R>=3.3
- Linux centos
- Python packages:rpy2,pandas,numpy
- R packages:stringr,optparse,DEGseq
python TransAnnot.py -f [fasta] -g [genome fasta] -o [output directory] -a[annot gtf] -p [process] --use_minimap2 [1] --use_hisat2 [hisat2 index]
or you can use
python TransAnnot.py -c TransAnnot.Config
The running time is about less than 1 hours with 8 core on a Linux server
1.the config file contain environmental path of each software and the index file of the reference genome
-
you can set the following parameters at the first time
- the path of HISAT2/Minimap2
- the index file of reference genome
- reference genome(FASTA)、anotiation of transcript file default Ensemble(GTF)、process number
-
After setting the base parameters,you can set the fasta file of the full length transcript and the output dictionary or you can use
-c config
and-f [fatsa] -o [output]
-
The reads\transcript\gene expression can be caculated by the parameter of --tpm
The output files contain the following files:
-
{sample_id}.annot.bed the bed format with annotated genes
-
{sample_id}.annot.stat the annotations of each transcripts
-
{sample_id}.annot.db.pickle the input file of visualization
-
{sample_id}.annot.cluster.gene the cluster of genes
-
{sample_id}.annot.cluster.transcript the cluster of transcript
-
{sample_id}.annot.cluster.reads the cluster of reads
-
{sample_id}.annot.junction the information of splice junction file
-
{sample_id}.annot.multiAnno muliti-annotation transcript
ID
: reads IDClassification
: classificatioin of readsSubtype
: subtype of readsGene
: gene annotation or region in genome[chr1:100000-100500]Transcript
: transcript annotationChrom
: chromosomeStrand
: strandSeq_length
: reads lengthSeq_exon
: exon number of readsRef_length
: length of transcript annotationRef_exon_num
: exon number of transcript annotatioindiff_to_gene_start
: 5` site difference of reads and annotation gene in reference genomediff_to_gene_end
: 3` site difference of reads and annotation gene in reference genomediff_to_transcript_start
: 5` site difference of reads and annotation transcript in reference genomediff_to_transcript_end
: 3` site difference of reads and annotation transcript in reference genomeexon_miss_to_transcript_start
: number of exon missed in 5` site between reads and transcript annotationexon_miss_to_transcript_end
: number of exon missed in 3` site between reads and transcript annotation
FASTA
:[path]
,input file,fasta format of full length transcriptOUTPUT_DIR
:[path]
,the output dictionaryGENOME_FA
:[path]
,the fasta file of reference genome (eg,hg38.fa)GTF_ANNOTATION
:[path]
,the annotion file of gene default gtf formatPROCESS
:[int]
,the number of processSAMPLE_UNIQUE_NAME
:[string]
,the output prefix of each filesPYTHON
:[path]
,the pathway of pythonTAGET_DIR
:[path]
,the pathway of TAGETSAMTOOLS
:[path]
,the pathway of samtoolsUSE_HISAT2
:[int]
,wether or not use HISAT2, 1 means use,0 means not useHISAT2
:[path]
,the pathway of Hisat2HISAT2_INDEX
:[path]
,the pathway of index of Hisat2 ,generated byhisat2-build
USE_MINIMAP2
:[int]
,wether or not use minimap2, 1 means use,0 means not useMINIMAP2
:[path]
,the pathway of Minimap2USE_GMAP
:[int]
,wether or not use GMAP, 1 means use,0 means not useGMAP
:[path]
,the pathway of GMAPGMAP_INDEX
:[path]
,the pathway of GMAP index,generated bygmap_build
TPM_LIST
:[path]
,the expression of IsoformREAD_LENGTH
:[int]
,the read length used by HISAT2,default 100READ_OVERLAP
:[int]
the read overlap used by HISAT2, default 80MIN_READ_LENGTH
:[int]
,the minimum length of read,default 30
We can use TransAnnotMerge to generate expression matrix of multi-samples
extract isoform expression from fasta file:
python fa2exp.py -f [fa] -i [prefix] -o [exp] -p [taget output dictionary]
-f
: full length transcript fasta format file-o
: output dictionary-p
: prefix of {sample_uniqe_name}.anno files
python script.py input.config
This step needs to use exon.gtf file,which can be unzipped by using unzip exon.gtf.zip
python TranAnnotMerge -c MergeConfig -o outputdir -m [TPM/FLC/None]
-c
: Merge Config,consist of four coloumn,sample ID() ,{sample_id}.annot.stat,{sample_id}.annot.bed,{sample_id}.annot.db.pickle。
#sample | stat | bed | db |
---|---|---|---|
------- | ---- | --- | -- |
-o
: the output dictionary-m
:the gene and transcript expression displayed by different methods,FLC: full length count,if none is not expression matrix
- extract isoform expression from fasta file:
python fa2exp.py -f [fa] -o [exp]
- running TranAnnotMerge:
python TranAnnotMerge -c MergeConfig -o outputdir -m TPM
{sample_id}.reads.exp
: the read expression of each file{sample_id}.transcript.exp
: the transcript expression of each file{sample_id}.gene.exp
: the gene expression of each filegene.exp
: the gene expression matrix of each sampletranscript.exp
: the transcirpt expression matrix of each samplemerge.db.pickle
: view transcirpt
python expression_V1.py -t {sample_id}.transcript.exp -g {sample_id}.gene.exp -o {prefix}
-t transcript expression of tumor and normal
-g gene expression of tumor and normal
-o prefix of outfile
-r default 0.05 filter the low express transcript
-p default 50 filter the low express gene
FSM
: full splice site matchISM
: incomplete splice site matchNIC
: novel in catalogNNC
: novel not in catalogGENIC
: genicINTERGENIC
: intergenicFUSION
: fusionUNKNOWN
: unknown
KE
: known exonLEKE
: left end known exonREKE
: right end known exonNEKSLE
: novel exon with known splice site in left end exon and has the unique region overlap with at least two known exonsNEKSRE
: novel exon known splice site in right end exon and has the unique region overlap with at least two known exonsIE
: intron retention: two known splice sites from the same transcript's sequential exonNEDT
: novel exon with two known splice sites from different transcriptNELS
: novel exon with novel left splice siteNERS
: novel exon with novel right splice siteLEE
: left exon_extension: the novel splice site in the left end of the exon which is longer than any exons overlap with itREE
: right exon_extension: the novel splice site in the right end of the exon which is longer than any exons overlap with itNEDS
: novel exon:double novel splice sites overlap with at least one known exonNEIG
: novel exon inner-gene:novel exon inside the gene and without any overlap with known exonNEOG
: novel exon inter-gene:novel exon outside the geneNELE
: novel exon with novel splice site in the far left exonNERE
: novel exon with novel splice site in the far right exonMDNS
: monoexon with double novel splice sites
Rscript TAGET_fusion_2-3_ajust.r -j Jin_fusion_select.py -e STAT_select.py -c chuli.py -l ${i}.fa.minimap2.bed -s ${i}.fa.hisat2.bed -a ${i}.fa.anno.tmp.stat -t hg38.gtf -f ${i}.fa -n ${i} -o ./output
${i}.fa.long.bed: the bed file mapped by minimap2 generated by TAGET
${i}.fa.short.bed:the bed file mapped by HISAT2 generated by TAGET
${i}.fa: the CCS read from Pacbio platform
${i}: the prefix of generated file name
We can download data.7z,demo.7z and script.7z to run the demo, the running time is about less than 1hours with 8 core on a Linux server.Details can be seen example.readme. Reference genome can be downlaod from https://disk.pku.edu.cn:443/link/1F62976F65C4EA81C4C06A05E245049D
Here we used hg38 to annotate transcripts,human reference hg38.fa and Ensemble gtf format files were needed. HISAT2 and minimap2 need to index this reference. The pickle file of gtf format file can be generated by using gtf_db_make.py python gtf_db_make.py hg38.ensembl.gtf hg38.ensembl.v20200306.1.pickle #Demos ###dependency HISAT2 v.2.2.1 MINIMAP2 v2.24 samtools v1.19 python3.9 R3.5 Linux cento OS7 Python packages: rpy2 v3.3.3 pandas v1.2.3 numpy v1.22.0 R packages: stringr v1.5.0 optparse v1.7.3 DEGseq v1.12
TAGET is a computational toolkit that provides a wide spectrum of tools for analyzing full-length transcriptome data. Based on its highly precise transcript alignment and junction prediction, TAGET enables accurate novel isoform, gene fusion detection, and expression quantification analyses
- HISAT2/MINIMAP2/GMAP at least one
- samtools
- python3
- R>=3.3
- Linux cento OS
- Python packages:rpy2,pandas,numpy
- R packages:stringr,optparse,DEGseq
python TransAnnot.py -f [fasta] -g [genome fasta] -o [output directory] -a[annot gtf] -p [process] --use_minimap2 [1] --use_hisat2 [hisat2 index]
or you can use
python TransAnnot.py -c TransAnnot.Config
The running time is about less than 1 hours with 8 core on a Linux server
1.the config file contain environmental path of each software and the index file of the reference genome
-
you can set the following parameters at the first time
- the path of HISAT2/Minimap2
- the index file of reference genome
- reference genome(FASTA)、anotiation of transcript file default Ensemble(GTF)、process number
-
After setting the base parameters,you can set the fasta file of the full length transcript and the output dictionary or you can use
-c config
and-f [fatsa] -o [output]
-
The reads\transcript\gene expression can be caculated by the parameter of --tpm
The output files contain the following files:
-
{sample_id}.annot.bed the bed format with annotated genes
-
{sample_id}.annot.stat the annotations of each transcripts
-
{sample_id}.annot.db.pickle the input file of visualization
-
{sample_id}.annot.cluster.gene the cluster of genes
-
{sample_id}.annot.cluster.transcript the cluster of transcript
-
{sample_id}.annot.cluster.reads the cluster of reads
-
{sample_id}.annot.junction the information of splice junction file
-
{sample_id}.annot.multiAnno muliti-annotation transcript
ID
: reads IDClassification
: classificatioin of readsSubtype
: subtype of readsGene
: gene annotation or region in genome[chr1:100000-100500]Transcript
: transcript annotationChrom
: chromosomeStrand
: strandSeq_length
: reads lengthSeq_exon
: exon number of readsRef_length
: length of transcript annotationRef_exon_num
: exon number of transcript annotatioindiff_to_gene_start
: 5` site difference of reads and annotation gene in reference genomediff_to_gene_end
: 3` site difference of reads and annotation gene in reference genomediff_to_transcript_start
: 5` site difference of reads and annotation transcript in reference genomediff_to_transcript_end
: 3` site difference of reads and annotation transcript in reference genomeexon_miss_to_transcript_start
: number of exon missed in 5` site between reads and transcript annotationexon_miss_to_transcript_end
: number of exon missed in 3` site between reads and transcript annotation
FASTA
:[path]
,input file,fasta format of full length transcriptOUTPUT_DIR
:[path]
,the output dictionaryGENOME_FA
:[path]
,the fasta file of reference genome (eg,hg38.fa)GTF_ANNOTATION
:[path]
,the annotion file of gene default gtf formatPROCESS
:[int]
,the number of processSAMPLE_UNIQUE_NAME
:[string]
,the output prefix of each filesPYTHON
:[path]
,the pathway of pythonTAGET_DIR
:[path]
,the pathway of TAGETSAMTOOLS
:[path]
,the pathway of samtoolsUSE_HISAT2
:[int]
,wether or not use HISAT2, 1 means use,0 means not useHISAT2
:[path]
,the pathway of Hisat2HISAT2_INDEX
:[path]
,the pathway of index of Hisat2 ,generated byhisat2-build
USE_MINIMAP2
:[int]
,wether or not use minimap2, 1 means use,0 means not useMINIMAP2
:[path]
,the pathway of Minimap2USE_GMAP
:[int]
,wether or not use GMAP, 1 means use,0 means not useGMAP
:[path]
,the pathway of GMAPGMAP_INDEX
:[path]
,the pathway of GMAP index,generated bygmap_build
TPM_LIST
:[path]
,the expression of IsoformREAD_LENGTH
:[int]
,the read length used by HISAT2,default 100READ_OVERLAP
:[int]
the read overlap used by HISAT2, default 80MIN_READ_LENGTH
:[int]
,the minimum length of read,default 30
We can use TransAnnotMerge to generate expression matrix of multi-samples
extract isoform expression from fasta file:
python fa2exp.py -f [fa] -i [prefix] -o [exp] -p [taget output dictionary]
-f
: full length transcript fasta format file-o
: output dictionary-p
: prefix of {sample_uniqe_name}.anno files
python script.py input.config
This step needs to use exon.gtf file,which can be unzipped by using unzip exon.gtf.zip
python TranAnnotMerge -c MergeConfig -o outputdir -m [TPM/FLC/None]
-c
: Merge Config,consist of four coloumn,sample ID() ,{sample_id}.annot.stat,{sample_id}.annot.bed,{sample_id}.annot.db.pickle。
#sample | stat | bed | db |
---|---|---|---|
------- | ---- | --- | -- |
-o
: the output dictionary-m
:the gene and transcript expression displayed by different methods,FLC: full length count,if none is not expression matrix
- extract isoform expression from fasta file:
python fa2exp.py -f [fa] -o [exp]
- running TranAnnotMerge:
python TranAnnotMerge -c MergeConfig -o outputdir -m TPM
{sample_id}.reads.exp
: the read expression of each file{sample_id}.transcript.exp
: the transcript expression of each file{sample_id}.gene.exp
: the gene expression of each filegene.exp
: the gene expression matrix of each sampletranscript.exp
: the transcirpt expression matrix of each samplemerge.db.pickle
: view transcirpt
python expression_V1.py -t {sample_id}.transcript.exp -g {sample_id}.gene.exp -o {prefix}
-t transcript expression of tumor and normal
-g gene expression of tumor and normal
-o prefix of outfile
-r default 0.05 filter the low express transcript
-p default 50 filter the low express gene
FSM
: full splice site matchISM
: incomplete splice site matchNIC
: novel in catalogNNC
: novel not in catalogGENIC
: genicINTERGENIC
: intergenicFUSION
: fusionUNKNOWN
: unknown
KE
: known exonLEKE
: left end known exonREKE
: right end known exonNEKSLE
: novel exon with known splice site in left end exon and has the unique region overlap with at least two known exonsNEKSRE
: novel exon known splice site in right end exon and has the unique region overlap with at least two known exonsIE
: intron retention: two known splice sites from the same transcript's sequential exonNEDT
: novel exon with two known splice sites from different transcriptNELS
: novel exon with novel left splice siteNERS
: novel exon with novel right splice siteLEE
: left exon_extension: the novel splice site in the left end of the exon which is longer than any exons overlap with itREE
: right exon_extension: the novel splice site in the right end of the exon which is longer than any exons overlap with itNEDS
: novel exon:double novel splice sites overlap with at least one known exonNEIG
: novel exon inner-gene:novel exon inside the gene and without any overlap with known exonNEOG
: novel exon inter-gene:novel exon outside the geneNELE
: novel exon with novel splice site in the far left exonNERE
: novel exon with novel splice site in the far right exonMDNS
: monoexon with double novel splice sites
Rscript TAGET_fusion_2-3_ajust.r -j Jin_fusion_select.py -e STAT_select.py -c chuli.py -l ${i}.fa.minimap2.bed -s ${i}.fa.hisat2.bed -a ${i}.fa.anno.tmp.stat -t hg38.gtf -f ${i}.fa -n ${i} -o ./output
${i}.fa.long.bed: the bed file mapped by minimap2 generated by TAGET
${i}.fa.short.bed:the bed file mapped by HISAT2 generated by TAGET
${i}.fa: the CCS read from Pacbio platform
${i}: the prefix of generated file name
We can download data.7z,demo.7z and script.7z to run the demo, the running time is about less than 1hours with 8 core on a Linux server.Details can be seen example.readme. Reference genome can be downlaod from https://disk.pku.edu.cn:443/link/5E0152D82C71B992690CFA9D7A3B5CF8 https://disk.pku.edu.cn:443/link/5E0152D82C71B992690CFA9D7A3B5CF8 https://zenodo.org/records/10091914
Here we used hg38 to annotate transcripts,human reference hg38.fa and Ensemble gtf format files were needed. HISAT2 and minimap2 need to index this reference. The pickle file of gtf format file can be generated by using gtf_db_make.py python gtf_db_make.py hg38.ensembl.gtf hg38.ensembl.v20200306.1.pickle
- HISAT2 v.2.2.1
- MINIMAP2 v2.24
- samtools v1.19
- python3.9
- R3.5
- Linux centos7
- Python packages:
- rpy2 v3.3.3
- pandas v1.2.3
- numpy v1.22.0
- R packages:
- stringr v1.5.0
- optparse v1.7.3
- DEGseq v1.12
1 Fast run
python TransAnnot.py -c 759133C.Config python TransAnnot.py -c 759133N.Config running time:72 minutes outputs: the dictionary of 759133C 759133C.minimap2.bed 759133C.hisat2.bed 759133C.annot.bed 759133C.annot.stat 759133C.annot.db.pickle 759133C.annot.cluster.gene 759133C.annot.cluster.transcript 759133C.annot.cluster.reads 759133C.annot.junction 759133C.annot.multiAnno 759133C.anno.tmp.stat
the dictionary of 759133N 759133N.minimap2.bed 759133N.hisat2.bed 759133N.annot.bed 759133N.annot.stat 759133N.annot.db.pickle 759133N.annot.cluster.gene 759133N.annot.cluster.transcript 759133N.annot.cluster.reads 759133N.annot.junction 759133N.annot.multiAnno 759133N.anno.tmp.stat
2.TransAnnotMerge
python fa2exp.py -f 759133C.fa -i 759133C -o 759133C -p ./expression python fa2exp.py -f 759133N.fa -i 759133N -o 759133N -p ./expression python script.py input.config running time:35 minutes outputs 759133.reads.exp 759133.transcript.exp
3.DIU analysis python expression_V1.py -t 759133.transcript.exp -g 759133.gene.exp -o 759133 running time:2 minutes outputs: 759133_DIU.txt
4 gene fusion
Rscript TAGET_fusion_2-3_ajust.r -j Jin_fusion_select.py -e STAT_select.py -c chuli.py -l 759133C.minimap2.bed -s 759133C.hisat2.bed -a 759133C.fa.anno.tmp.stat -t hg38.gtf -f 759133C.fa -n 759133C -o ./output running time:18 minutes output 759133C.fusion