TSEBRA IS DEVELOPED AND MAINTAINED AT https://github.com/Gaius-Augustus/TSEBRA.
TSEBRA is a combiner tool that selects transcripts from gene predictions based on the support by extrisic evidence in form of introns and start/stop codons. It was developed to combine BRAKER11 and BRAKER22 predicitons to increase their accuracies.
Python 3.5.2 or higher is required.
Download TSEBRA:
git clone https://github.com/LarsGab/TSEBRA
Or download TSEBRA as submodule of BRAKER with:
git clone --recurse-submodules https://github.com/Gaius-Augustus/BRAKER
The main script is ./bin/tsebra.py
. For usage information run ./bin/tsebra.py --help
.
TSEBRA needs a list of gene prediciton files, a list of hintfiles and a configuration file as input.
The gene prediction files needs to be in gtf format. This is the standard output format of a BRAKER or AUGUSTUS3,4 gene prediciton.
Example:
2L AUGUSTUS gene 83268 87026 0.88 - . g5332
2L AUGUSTUS transcript 83268 87026 0.88 - . g5332.t1
2L AUGUSTUS intron 84278 87019 1 - . transcript_id "file_1_file_1_g5332.t1"; gene_id "file_1_file_1_g5332";
2L AUGUSTUS CDS 87020 87026 0.88 - 0 transcript_id "file_1_file_1_g5332.t1"; gene_id "file_1_file_1_g5332";
2L AUGUSTUS exon 87020 87026 . - . transcript_id "file_1_file_1_g5332.t1"; gene_id "file_1_file_1_g5332";
The hint files have to be in gff format, the last column must include an attribute for the source for the hint with 'src=' and can include the number of hints supporting the gene structure segment with 'mult='. This is the standard file format of the hintfiles.gff
in a BRAKER working directory.
Example:
2L ProtHint intron 279806 279869 2 + . src=P;mult=25;pri=4;al_score=0.437399;
2L ProtHint intron 275252 275318 2 - . src=P;mult=19;pri=4;al_score=0.430006;
2L ProtHint stop 293000 293002 1 + 0 grp=7220_0:002b08_g42;src=C;pri=4;
2L ProtHint intron 207632 207710 1 + . grp=7220_0:002afa_g26;src=C;pri=4;
2L ProtHint start 207512 207514 1 + 0 grp=7220_0:002afa_g26;src=C;pri=4;
The configuration file has to include three types of parameter:
- The weight for each hint source. A weight is set to 1, if the weight for a source is not determined in the cfg file.
- Required fraction of supported introns or supported start/stop-codons for a transcript.
- Allowed difference between two overlapping transcripts for each feature type.
Example:
# src weights
P 0.1
E 10
C 5
M 1
# Low evidence support
intron_support 0.75
stasto_support 1
# Feature differences
e_1 0
e_2 0.5
e_3 25
e_4 10
The recommended and most common usage for TSEBRA is to combine the resulting braker.gtf
files of a BRAKER1 and a BRAKER2 run using the hintsfile.gff from both working directories. However, TSEBRA can be applied to any number (>1) of gene predictions and hint files as long as they are in the correct format.
A common case might be that a user wants to annotate a novel genome with BRAKER and has:
- a novel genome with repeats masked:
genome.fasta.masked
, - hints for intron positions from RNA-seq reads
rna_seq_hints.gff
, - database of homologous proteins:
proteins.fa
.
- Run BRAKER1 and BRAKER2 for example with
### BRAKER1
braker.pl --genome=genome.fasta.masked --hints=rna_seq_hints.gff \
--softmasking --species=species_name --workingdir=braker1_out
### BRAKER2
braker.pl --genome=genome.fasta.masked --prot_seq=proteins.fa \
--softmasking --species=species_name --epmode --prg=ph \
--workingdir=braker2_out
- Make sure that the gene and transcript IDs of the gene prediction files are in order (this step is optional)
./bin/fix_gtf_ids.py --gtf braker1_out/braker.gtf --out braker1_fixed.gtf
./bin/fix_gtf_ids.py --gtf braker2_out/braker.gtf --out braker2_fixed.gtf
- Combine predicitons with TSEBRA
./bin/tsebra.py -g braker1_fixed.gtf,braker2_fixed.gtf -c default.cfg \
-e braker1_out/hintsfile.gff,braker2_out/hintsfile.gff \
-o braker1+2_combined.gtf
The combined gene prediciton is braker1+2_combined.gtf
.
A small example is located at example/
. Run ./example/run_prevco_example.sh
to execute the example and to check if TSEBRA runs properly.
All source code, i.e. bin/*.py
are under the Artistic License (see https://opensource.org/licenses/Artistic-2.0).
[1] Hoff, Katharina J, Simone Lange, Alexandre Lomsadze, Mark Borodovsky, and Mario Stanke. 2015. “BRAKER1: Unsupervised Rna-Seq-Based Genome Annotation with Genemark-et and Augustus.” Bioinformatics 32 (5). Oxford University Press: 767--69.↑
[2] Tomas Bruna, Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke and Mark Borodvsky. 2021. “BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database." NAR Genomics and Bioinformatics 3(1):lqaa108.↑***
[3] Stanke, Mario, Mark Diekhans, Robert Baertsch, and David Haussler. 2008. “Using Native and Syntenically Mapped cDNA Alignments to Improve de Novo Gene Finding.” Bioinformatics 24 (5). Oxford University Press: 637--44.↑
[4] Stanke, Mario, Oliver Schöffmann, Burkhard Morgenstern, and Stephan Waack. 2006. “Gene Prediction in Eukaryotes with a Generalized Hidden Markov Model That Uses Hints from External Sources.” BMC Bioinformatics 7 (1). BioMed Central: 62.↑