The following instructions are a prelimary protocol for producing a BRAKER123 gene set that integrates extrinsic evidence from long-read RNA-Seq (PacBio ccs), short-read RNA-seq (Illumina) and a large database of protein sequences (e.g. OrthoDB 9 clade) into a single prediction.
⚠️ This protocol was developed for presentation of intermediate results at The Plant and Animal Genome Conference XXIX (2022) in San Diego. Significant changes are expected prior publication in a peer reviewed scientific journal. For preliminary accuracy results, see https://github.com/Gaius-Augustus/BRAKER/blob/master/docs/slides/slides_PAG2022.pdf (will be made available after January 9th 2022).
You have to install the 'long_reads' branch of BRAKER via GitHub https://github.com/Gaius-Augustus/BRAKER.
⚠️ The long read protocol does not work with e.g. an Anaconda installation of BRAKER because the long read branch is not part of the BRAKER release, yet.
Clone BRAKER:
git clone https://github.com/Gaius-Augustus/BRAKER
Switch to the 'long_reads' branch:
cd BRAKER/
git checkout long_reads
Follow the installation instructions of the README.md in the BRAKER repository.
Importantly, export BRAKER/scripts
to your $PATH
variable, e.g:
export PATH="/your/path/to/BRAKER/scripts:$PATH"
Make sure that braker.pl
and gmst2globalCoords.py
are executable (e.g. run which braker.pl
and which gmst2globalCoords.py
which should return the location of the respective scripts).
You have to install the 'long_reads' branch of TSEBRA4 via GitHub https://github.com/Gaius-Augustus/TSEBRA.
⚠️ The long read protocol does not work with e.g. an Anaconda installation of TSEBRA because the long read branch is not part of the TSEBRA release, yet.
Clone TSEBRA:
git clone https://github.com/Gaius-Augustus/TSEBRA
Switch to 'long_reads' branch:
cd TSEBRA/
git checkout long_reads
Export TSEBRA/bin
to your $PATH
variable, e.g:
export PATH="/your/path/to/TSEBRA/bin:$PATH"
Make sure that tsebra.py
is executable (e.g. by running which tsebra.py
, this should return the location of the tsebra.py executable).
Download GeneMarkS-T5 from https://exon.gatech.edu/GeneMark/license_download.cgi (the GeneMarkS-T) option. Unpack and install GeneMarkS-T as described in GeneMarkS-T’s README
file.
Export gmst.pl
to your $PATH
variable, e.g:
export PATH="/your/path/to/gmst/gmst.pl:$PATH"
Download Minimap26 from GitHub at https://github.com/lh3/minimap2. Follow the instructions in
the README file to install it and make sure that minimap2
is executable.
You have to install Cupcake as it is described here.
Make sure that collapse_isoforms_by_sam.py
is executable.
AUGUSTUS78 should already be downloaded for the BRAKER installation. Make sure that the scripts at AUGUSTUS/scripts
are executable (i.e. stringtie2fa.py
) and available in your $PATH
variable, e.g:
export PATH="/your/path/to/AUGUSTUS/scripts:$PATH"
Below is a complete list of all necessary input files and their names used in this manual.
genome.fa
- genome sequence in FASTA format that has been softmasked for repeatsRNAseq.bam
- spliced alignments of short-read RNA-Seq in BAM format (RNA-Seq hints can be used instead of the BAM file, see the documentation of BRAKER for usage information)proteins.fa
- a large database of protein sequences in FASTA format (e.g. a suitable OrthoDB partition)subreads1.fastq,subreads2.fastq,subreads3.fastq,...
- list of assembled subread libraries from long-read RNA-Seq as FASTQ files (protocol has only been tested with PacBio ccs reads)
The workflow of this protocol is divided into four parts. In the first three steps, you have to generate three different gene sets, each using one source of extrinsic evidence. In the final step, these gene sets are combined using TSEBRA. The first three steps (BRAKER1, BRAKER2, GeneMarkS-T protocol) can be executed in any order.
Before you start, create a working directory for your project and set the number of threads that you want to use on your processor:
threads=8
wdir='/your/path/to/the/working/directory/'
mkdir $wdir
Make sure that your $GENEMARK_PATH
(pointing to GeneMark-ES/ET/EP, not to GeneMarkS-T) and $AUGUSTUS_CONFIG_PATH
paths are set correctly
or pass them manually to BRAKER by adding them in the command line below. See the BRAKER documentation for more information.
For convenience we will store all results in the same location, referred to as $wdir. You need to adapt this bash variable:
wdir=/a/directory/of/your/choice/
Run BRAKER using the spliced alignments of short-read RNA-seq to execute the BRAKER1 protocol:
braker.pl --genome=genome.fa --softmasking --cpu=$threads --bam=RNAseq.bam --workingdir=$wdir/braker1/ 2> $wdir/braker1.log
The important results from this run are the BRAKER1 gene set augustus.hints.gtf
and the hints from short-reads hintsfile.gff
. Make sure that they are available at $wdir/braker1/
.
Make sure that your $GENEMARK_PATH
, PROTHINT_PATH
, DIAMOND_PATH
and $AUGUSTUS_CONFIG_PATH
paths are set correctly
or pass them manually to BRAKER by adding them in the command line below. See the BRAKER documentation for more information.
Run BRAKER using the proteins to execute the BRAKER2 protocol:
braker.pl --genome=genome.fa --softmasking --cpu=$threads --epmode --prot_seq=proteins.fa --workingdir=$wdir/braker2/ 2> $wdir/braker2.log
The important results from this run are the BRAKER2 gene set augustus.hints.gtf
and the hints generated from the protein database hintsfile.gff
. Make sure that they are available at $wdir/braker2/
.
In this protocol, you map the transcripts from the assembled subread libraries from long-read RNA-Seq to the genome sequence and use GeneMarkS-T to identify protein-coding regions in the transcripts.
Create a working directory and change into it:
mkdir $wdir/long_read_protocol
cd $wdir/long_read_protocol
Concatenate the subread libraries files into a single file (note that you have to replace subreads1.fastq,subreads2.fastq,subreads3.fastq
with your list of files in following command, it can be arbitrariely many files):
cat subreads1.fastq,subreads2.fastq,subreads3.fastq > all.subreads.fastq
Use Minimap2 to map the transcripts to the genome sequence, and then sort the result:
minimap2 -t $threads -ax splice:hq genome.fa all.subreads.fastq > long_reads.sam
sort -k 3,3 -k 4,4n $wlong_reads.sam > long_reads.s.sam
Collapse redundant isoforms using a script from Cupcake:
collapse_isoforms_by_sam.py --input all.subreads.fastq --fq -s long_reads.s.sam --dun-merge-5-shorter -o cupcake
You will find the collapsed transcripts in cupcake.collapsed.gff
.
Run GeneMarkS-T to predict protein-coding regions in the transcripts:
stringtie2fa.py -g genome.fa -f cupcake.collapsed.gff -o cupcake.fa
gmst.pl --strand direct cupcake.fa.mrna --output gmst.out --format GFF
Use the GeneMarkS-T coordinates and the long-read transcripts to create a gene set in GTF format.
gmst2globalCoords.py -t cupcake.collapsed.gff -p gmst.out -o gmst.global.gtf -g genome.fa
The result of this step is located at $wdir/long_read_protocol/gmst.global.gtf
.
In the final step, the long-read version of TSEBRA is used to combine the three gene sets using all extrinsic evidence.
Create a working directory:
mkdir $wdir/tsebra
cd $wdir/tsebra
Run TSEBRA (note that you have to change the TSEBRA path to your TSEBRA installation directory for the -c
option):
tsebra.py -g $wdir/braker1/augustus.hints.gtf,$wdir/braker2/augustus.hints.gtf -e $wdir/braker1/hintsfile.gff,$wdir/braker2/hintsfile.gff -l $wdir/long_read_protocol/gmst.global.gtf -c /your/path/to/TSEBRA/config/long_reads.cfg -o tsebra.gtf
The final gene set is located at $wdir/tsebra/tsebra.gtf
.
[1] K. J. Hoff, S. Lange, A. Lomsadze, M. Borodovsky, and M. Stanke. 2015. “BRAKER1: Unsupervised Rna-Seq-Based Genome Annotation with GeneMark-Et and AUGUSTUS.” Bioinformatics 32 (5). Oxford University Press: 767--69.↑
[2] T. Bruna, K. J. Hoff, A. Lomsadze, M. Stanke and M. Borodvsky. 2021. “BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database." NAR Genomics and Bioinformatics 3(1):lqaa108.↑
[3] K. J. Hoff, A. Lomsadze, M. Borodovsky, and M. Stanke. (2019). Whole-Genome Annotation with BRAKER. Methods Mol Biol. 1962:65-95, doi: 10.1007/978-1-4939-9173-0_5. ↑
[4] L. Gabriel, K. J. Hoff, T. Bruna, M. Borodovsky and M. Stanke. 2021. "TSEBRA: transcript selector for BRAKER." Bioinformatics 22: 566.↑
[5] S. Tang, A. Lomsadze and M. Borodovsky. 2015. "Identification of protein coding regions in RNA transcripts." Nucleic acids research 43 (12): e78-e78.↑
[6]H. Li. 2018. "Minimap2: pairwise alignment for nucleotide sequences". Bioinformatics Volume 34 (18): 3094-3100.↑
[7] M. Stanke, M. Diekhans, R. Baertsch, and D. Haussler. 2008. “Using Native and Syntenically Mapped cDNA Alignments to Improve de Novo Gene Finding.” Bioinformatics 24 (5). Oxford University Press: 637--44.↑
[8] M. Stanke, O. Schöffmann, B. Morgenstern, and S. Waack. 2006. “Gene Prediction in Eukaryotes with a Generalized Hidden Markov Model That Uses Hints from External Sources.” BMC Bioinformatics 7 (1). BioMed Central: 62.↑
[9] E. V. Kriventseva, D. Kuznetsov, F. Tegenfeldt, M. Manni, R. Dias, F. A. Simão and E. M. Zdobnov. 2019. OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic acids research, 47(D1), D807-D811.↑