De novo Viral Genome Annotator
VIGA is a script written in Python 3 that annotates viral genomes automatically (using a de novo algorithm) and predict the function of their proteins using BLAST and HMMER. This script works in UNIX-based OS, including MacOSX and the Windows Subsystem for Linux.
Before using this script, the following Python modules and programs should be installed:
-
Python modules:
- BCBio.GFF (https://github.com/chapmanb/bcbb/tree/master/gff)
- Biopython (Bio module; Cock et al. 2009)
- Numpy (https://github.com/numpy/numpy)
- Scipy (https://github.com/scipy/scipy)
- pyHMMer (Larralde and Zeller, 2023; https://github.com/althonos/pyhmmer): especially from 0.10.15 towards
-
Programs:
- LASTZ (Harris 2007): it is used to predict the circularity of the contigs. The program is publicly available at https://github.com/lastz/lastz under the MIT licence.
- INFERNAL (Nawrocki and Eddy 2013): it is used to predict ribosomal RNA in the contigs when using the RFAM database (Nawrocki et al. 2015). This program is publicly available at https://eddylab.org/infernal/ under the BSD licence and RFAM database is available at ftp:https://ftp.ebi.ac.uk/pub/databases/Rfam/
- ARAGORN (Laslett and Canback 2004): it is used to predict tRNA sequences in the contig. This program is publicly available at https://mbio-serv2.mbioekol.lu.se/ARAGORN/ under the GPLv2 licence.
- PILERCR (Edgar 2007): it is used to predict CRISPR repeats in your contig. This program is freely available at https://drive5.com/pilercr/ under a public licence.
- Prodigal (Hyatt et al. 2010): it is used to predict the ORFs. When the contig is smaller than 100,000 bp, MetaProdigal (Hyatt et al. 2012) is automatically activated instead of normal Prodigal. This program is publicly available at https://github.com/hyattpd/prodigal/releases/ under the GPLv3 licence.
- Prodigal-GV (Camargo et al. 2023): this variant of Prodigal contains updated models for ORF detection in giant viruses and bacteriophages with alternate stop codons ("genetic code 15"). This approach is recommended for the annotation of bacteriophages, especially crAss-like phages (Cook et al. 2023). This program is available at https://github.com/apcamargo/prodigal-gv under the GPLv3 licence.
- DIAMOND (Buchfink et al. 2015): it is used to predict the function of proteins according to homology. This program is publicly available at https://github.com/bbuchfink/diamond under the GPLv3 licence. Databases must be created from FASTA files according to their instructions before running.
- BLAST+ (Camacho et al. 2008): it is used to predict the function of the predicted proteins according to homology when DIAMOND is not able to retrieve any hit or such hit is a 'hypothetical protein'. This suite is publicly available at ftp:https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ under the GPLv2 licence. Databases are available at ftp:https://ftp.ncbi.nlm.nih.gov/blast/db/ or created using makeblastdb command.
- HMMER (Finn et al. 2011): it is used to add more information of the predicted proteins according to Hidden Markov Models. This suite is publicly available at https://hmmer.org/ under the GPLv3 licence. Databases must be in HMM format and an example of potential database is PVOGs (https://dmk-brain.ecn.uiowa.edu/VOG/downloads/All/AllvogHMMprofiles.tar.gz).
Although you can install these programs manually, we strongly recommend the use of the installation bash script 'install.sh' to facilitate the installation of all these dependencies. After running the installation script, and before running VIGA, you must run the 'create_db.sh' bash script to install all requested databases. In this case, the databases that the program uses are RFAM (Nawrocki et al. 2015), RefSeq Viral Proteins (Brister et al. 2015), Prokaryotic Viral Orthologous Genes (PVOGs; Grazziotin et al. 2017), Reference Viral DataBase (RVDB; Bigot et al. 2020) and VOGs (Marz et al. 2014) which will be formatted automatically to be used in BLAST, DIAMOND and HMMER.
When using this program, you must cite their use:
González-Tortuero E, Sutton TDS, Velayudhan V, Shkoporov AN, Draper LA, Stockdale SR, Ross RP, Hill C (2018) VIGA: a sensitive, precise and automatic de novo VIral Genome Annotator. bioRxiv 277509; doi: https://doi.org/10.1101/277509
The program has the following two types of arguments:
--input FASTAFILE | Input file as a nucleotidic FASTA file. It can contains multiple sequences (e.g. metagenomic contigs) |
--modifiers TEXTFILE | Input file as a plain text file with the modifiers per every FASTA header according to SeqIn (https://www.ncbi.nlm.nih.gov/Sequin/modifiers.html). All modifiers must be written in a single line and are separated by a single space character. No space should be placed besides the = sign. For example: [organism=Serratia marcescens subsp. marcescens] [sub-species=marcescens] [strain=AH0650_Sm1] [moltype=DNA] [tech=wgs] [gcode=11] [country=Australia] [isolation-source=sputum]. This line will be copied and printed along with the record name as the definition line of every contig sequence. |
--out OUTPUTNAME | Name of the outputs files without extensions, as the program will add them automatically. By default, the program will use the input name as the output. |
--locus STRING | Name of the contigs/sequences. If the input is a multiFASTA file, please put a general name as the program will add the number of the contig at the end of the name. By default, the name of the contigs will be "LOC". |
--threads INT | Number of threads/CPUs. By default, the program will use all available CPUs. |
--mincontigsize INT | Minimum contig length to be considered in the final files. By default, the program only consider from 200 bp. |
--blast | Using BLAST to predict protein function based on homology. By default, DIAMOND (fast approach) will be used instead of BLAST, but it is highly recommendable to use BLAST for better results |
--readlength INT | Read length for the circularity prediction (default: 101 bp) |
--windowsize INT | Window length used to determine the origin of replication in circular contigs according to the cumulative GC skew(default: 100 bp) |
--slidingsize INT | Sliding window length used to determine the origin of replication in circular contigs according to the cumulative GC skew(default: 10 bp) |
--minrepeat INT | Minimum repeat length for CRISPR detection (Default: 16) |
--maxrepeat INT | Maximum repeat length for CRISPR detection (Default: 64) |
--minspacer INT | Minimum spacer length for CRISPR detection (Default: 8) |
--maxspacer INT | Maximum spacer length for CRISPR detection (Default: 64) |
--prodigal-gv | Use Prodigal-GV for viral ORF prediction instead of Prodigal. In this case, there is no need to use the --gcode option, as this program includes all special genetic codes automatically. Recommended for some phages like crAss-like phages and in cases where there is a low gene density and no ncRNA elements where found (Default: False) | ||||||||||||||||||||||||||||||||||||||||||||||||||
--gcode NUMBER | Number of GenBank translation table. At this moment, the available options are:
|
--typedata BCT|CON|VRL|PHG | GenBank Division: One of the following codes:
|
--norfam | Don't run RFAM to predict other ncRNAs, apart of rRNAs and tRNAs. (Default: False) |
--rfamdb RFAMDB | RFAM Database that will be used for the ncRNA prediction. RFAMDB should be in the format "/full/path/to/rfamdb/Rfam.cm" and must be compressed accordingly (see INFERNAL manual) before running the script. By default, the program will try to search Rfam inside the folder database/ (after running the Create_databases.sh script) |
--diamonddb DIAMONDDB | DIAMOND Database that will be used for the protein function prediction. The database must be created from a amino acid FASTA file as indicated in https://github.com/bbuchfink/diamond. By default, the program will try to search the RefSeq Viral Protein DB formatted for its use in Diamond inside the folder database/ only if --blast parameter is disabled and after running the Create_databases.sh script |
--diamondevalue FLOAT | DIAMOND e-value threshold. By default, the threshold will be 1e-05. |
--diamondidthr FLOAT | DIAMOND identity threshold to consider that a protein belong to a specific hit. By default, the threshold is 50.0 % |
--diamondcoverthr FLOAT | DIAMOND coverage threshold to consider that a protein belong to a specific hit. By default, the threshold is 50.0 % |
--blastdb BLASTDB | BLAST Database that will be used to refine the protein function prediction in hypothetical proteins. The database must be an amino acid one, not nucleotidic. By default, the program will try to search the RefSeq Viral Protein DB formatted for its use in BLAST inside the folder database/ only if --blast parameter is ensabled and after running the Create_databases.sh script |
--blastexh | Use of exhaustive BLAST to predict the proteins by homology according to Fozo et al. (2010). In this case, the search will be done using a word size of 2, a gap open penalty of 8, a gap extension penalty of 2, the PAM70 matrix instead of the BLOSUM62 and no compositional based statistics. This method is more accurate to predict the functions of the proteins but it is slower than BLAST default parameters. By default, exhaustive BLAST is disabled. |
--blastevalue FLOAT | BLAST e-value threshold. By default, the threshold will be 1e-05. |
--blastidthr FLOAT | BLAST identity threshold to consider that a protein belong to a specific hit. By default, the threshold is 50.0 % |
--blastcoverthr FLOAT | BLAST coverage threshold to consider that a protein belong to a specific hit. By default, the threshold is 50.0 % |
Advanced parameters for protein function annotation based on Hidden Markov Models using HMMer:
--nohmmer | Running only BLAST to predict protein function. In this case, the program will run fast. By default, this option is DISABLED. |
--hmmerdb HMMDB | HMMer Database that will be used to add additional information for all proteins according to Hidden Markov Models. This database must be in HMM format and it is mandatory if --nohmmer is disabled." |
--hmmerevalue FLOAT | HMMer e-value threshold. By default, the threshold is 1e-03. |
--hmmeridthr FLOAT | HMMer identity threshold. By default, the threshold is 50.0 % |
--hmmercoverthr FLOAT | HMMer coverage threshold. By default, the threshold is 50.0 % |
An example of execution (using only basic parameters) is:
python VIGA.py --input contigs.fasta --diamonddb databases/refseq_viral_proteins --blastdb databases/refseq_viral_proteins --hmmerdb databases/PVOGS.hmm --rfamdb databases/Rfam.cm --modifiers modifiers.txt
Another example (this time disabling the use of HMMer, which allows a fast execution of the pipeline) is:
python VIGA.py --input contigs.fasta --diamonddb databases/refseq_viral_proteins --blastdb databases/refseq_viral_proteins --nohmmer --rfamdb databases/rfam/Rfam.cm --modifiers ../modifiers.txt
Finally, in case that you need to run the pipeline in a server, computer cluster or supercomputer, using the --threads parameter is highly recommended:
python VIGA.py --input contigs.fasta --diamonddb databases/refseq_viral_proteins --blastdb databases/refseq_viral_proteins --nohmmer --rfamdb databases/rfam/Rfam.cm --modifiers ../modifiers.txt --threads 10
In this branch, there will be program versions with proposed changes. After being tested multiple times, these changes might (or not) be considered to update the main source code in the master branch.
- v.0.12.0 - Replaced HMMER by PyHMMER for the search using HMM models as found to be a much faster implementation than the original program.
- v.0.11.1 - Added the option of running Prodigal-GV instead of Prodigal as well as fixed some issues related with the generation of output files as a consequence of several changes in the Features in Biopython.
- v.0.11.0 - Uploaded developer version of the source code. In this case, the idea is to automate the process of installation, creation of databases and minimising the number of required parameters to be used. In future versions, it will be also implemented the automation of the PVOGs and RVDB output interpretation (based on Moura de Sousa et al. 2021)
- Bigot T, Temmam S, Pérot P, Eloit M (2020) RVDB-prot, a reference viral protein database and its HMM profiles. F1000Research 8: 530.
- Brister JR, Ako-Adjei D, Bao Y, Blinkova O (2015) NCBI viral genomes resource. Nucleic Acids Research 43: D571–7.
- Brown CT, Olm MR, Thomas BC, Banfield JF (2016) Measurement of bacterial replication rates in microbial communities. Nature Biotechnology 34: 1256-63.
- Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nature Methods 12: 59-60.
- Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2008) BLAST+: architecture and applications. BMC Bioinformatics 10: 421.
- Camargo AP, Roux S, Schulz F, Babinski M, Xu Y, Hu B, Chain PSG, Nayfach S, Kyrpides NC (2023) Identification of mobile genetic elements with geNomad. Nature Biotechnology: 1-10
- Cook R, Telatin A, Bouras G, Camargo AP, Larralde M, Edwards RA, Adriaenssens EM (2023) Predicting stop codon reassignment improves functional annotation of bacteriophages. bioRxiv 19: 2023.12.19.572299.
- Edgar RC (2007) PILER-CR: fast and accurate identification of CRISPR repeats. BMC Bioinformatics 8:18.
- Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Research 39: W29-37.
- Fozo EM, Makarova KS, Shabalina SA, Yutin N, Koonin EV, Storz G (2010) Abundance of type I toxin-antitoxin systems in bacteria: searches for new candidates and discovery of novel families. Nucleic Acids Research 38: 3743-59.
- Grazziotin AL, Koonin EV, Kristensen DM (2017) Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation. Nucleic Acids Research 45: D491–8.
- Harris RS (2007) Improved pairwise alignment of genomic DNA. Ph.D. Thesis, The Pennsylvania State University.
- Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11: 119.
- Hyatt D, Locascio PF, Hauser LJ, Uberbacher EC (2012) Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 28: 2223-30.
- Larralde M, Zeller G (2023) PyHMMER: a Python library binding to HMMER for efficient sequence analysis. Bioinformatics 39: btad214.
- Laslett D, Canback B (2004) ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Research 32: 11–16.
- Marz M, Beerenwinkel N, Drosten C, Fricke M, Frishman D, Hofacker IL, Hoffmann D, Middendorf M, Rattei T, Stadler PF, Töpfer A (2014) Challenges in RNA virus bioinformatics. Bioinformatics 30: 1793-9.
- Moura de Sousa JA, Pfeifer E, Touchon M, Rocha EPC (2021) Causes and consequences of bacteriophage diversification via genetic exchanges across lifestyles and bacterial taxa. Molecular Biology and Evolution, msab044: https://doi.org/10.1093/molbev/msab044
- Nawrocki EP, Eddy SR (2013) Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29: 2933-35.
- Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, Floden EW, Gardner PP, Jones TA, Tate J, Finn RD (2013) Rfam 12.0: updates to the RNA families database. Nucleic Acids Research 43: D130-7.
- Saripella GV, Sonnhammer EL, Forslund K (2016) Benchmarking the next generation of homology inference tools. Bioinformatics 32: 2636-41.
- Seemann T (2014) Prokka: rapid prokaryote genome annotation. Bioinformatics 30: 2068-9.
- Shah N, Nute MG, Warnow T, Pop M (in press) Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics doi: 10.1093/bioinformatics/xxxxx