Hybran is a hybrid reference-based and ab initio genome annotation pipeline for prokaryotic genomes. It uses the Rapid Annotation Transfer Tool (RATT) to transfer as many annotations as possible from your reference genome annotation based on conserved synteny between the nucleotide genome sequences. Hybran then supplements unannotated regions with ab initio predictions from Prokka. Then, all coding sequence annotations are clustered together and additional reference gene names are assigned based on amino acid sequence identity and alignment coverage.
This can be executed on one or many genomes. The more reference annotations included, the more accurate the annotation will be and less ambiguity will exist for the target genomes. Input can be a FASTA, a list of FASTAs (space-separated), a directory containing FASTAs, or a File Of FileNames (FOFN) of FASTAs.
hybran \
--genomes /dir/to/FASTAs | in.fasta [in2.fasta in3.fasta ...] | fastas.fofn \
--references /dir/to/reference/annotation(s) \
--output ./ \
--organism "Genus species strain" \
--nproc 2
Calling hybran
without specifying a subcommand is the same as calling hybran annotate
.
Except to see the help menu, you must do hybran annotate --help
.
Final annotations are created in Genbank and GFF formats in the output directory. The output directory also contains intermediate files and informative logs and reports:
outdir/
├── hybran.log
│
├── sample1/
│ ├── annomerge/
│ │ ├── sample1.gbk
│ │ ├── sample1.gff
│ │ ├── coord_corrections.tsv
│ │ ├── prokka_unused.tsv
│ │ ├── pseudoscan_report.tsv
│ │ └── ratt_unused.tsv
| ├── ratt/
│ │ └── ...
| ├── ratt-postprocessed/
│ │ ├── sample1.*.final.gbk
│ │ ├── sample1.*.final.gff
│ │ ├── coord_corrections.tsv
│ │ ├── invalid_features.tsv
│ │ └── pseudoscan_report.tsv
| ├── prokka/
│ │ └── ...
| ├── prokka-postprocessed/
│ │ ├── sample1.gbk
│ │ ├── sample1.gff
│ │ ├── coord_corrections.tsv
│ │ ├── invalid_features.tsv
│ │ └── pseudoscan_report.tsv
├── sampleN/
│ └── ...
│
├──unified-refs/
│ ├── unifications.tsv
| ├── unique_ref_cdss.faa
│ ├── reference1.gbk
│ ├── reference1.gff
│ ├── ...
│ ├── referenceN.gbk
│ └── referenceN.gff
├── clustering/
│ ├── multigene_clusters.txt
│ ├── novelty_report.tsv
│ ├── onlyltag_clusters.txt
│ └── singleton_clusters.txt
|
├── sample1.gbk
├── sample1.gff
├── ...
├── ...
├── sampleN.gbk
└── sampleN.gff
The verbose run log from the pipeline.
This will be equivalent to what you would see on the console if you ran with --verbose
.
hybran
generates revised reference annotations in the unified-refs
directory.
These annotations differ from the original in that each set of conserved (>=99% amino acid identity and alignment coverage) or duplicated genes is assigned a single name used for all instances.
The original name is retained as a gene_synonym
qualifier in the annotation file.
The file unifications.tsv
will list duplicate genes found in the reference annotations and the name they were assigned.
Columns in this file are
- reference name
- reference locus tag
- reference gene name
- unified name
A multi-fasta file of the representative amino acid sequences for each unique reference CDS.
Depending on the sequence identity and alignment coverage thresholds used, Hybran will name candidate novel genes. This novelty report allows you to examine whether these genes are truly unique based on how close they came to meeting the thresholds.
- cluster_type
- candidate_novel_gene
- nearest_ref_match The top hit among the reference or other candidate novel genes.
- metric
The
nearest_ref_match
is the top hit according to the metric specified in this column. Its values for all three metrics are shown in the next columns. - pct_aa_ident : Percent amino acid sequence identity
- pct_sub_covg : Percent subject (reference) alignment coverage
- pct_qry_covg : Percent query alignment coverage
- locus_tag: Locus tag of the rejected feature from the source indicated by the file name or parent directory.
- gene_name: Assigned gene name of the rejected feature (lifted over from reference annotation). Same as locus tag if none was assigned.
- rival_locus_tag: Locus tag of the prevailing feature.
- rival_gene_name: Assigned gene name of the prevailing feature (lifted over from reference annotation).
- evidence_codes: Summary of the reason for rejecting the feature.
- remark: A more verbose explanation of the rejection reason.
- no_coordinates : RATT sometimes outputs malformed feature locations (see, for example, RATT#18 and RATT#19). Hybran intercepts these during parsing of the results and sets an empty location to enable continuity of the pipeline. Since the malformed feature could not be properly parsed, however, there may not be a name to refer to in the log here.
- zero_length
- categorical : Currently, rRNAs and tRNAs are only taken from the ab initio annotation, so these are categorically rejected from RATT.
- misplaced
- poor_match
: When using
--filter-ratt
, annotations not meeting the blastp thresholds are rejected and have this evidence code applied.
fissionfuser
is only applied during postprocessing of the ab initio annotations.
- complementary_fragments
- overlapping_inframe
: This scenario arises as a result of postprocessing ab initio annotations.
When a CDS has an internal stop, the ab initio annotation often reports what looks like a tandem duplication.
Start coordinate correction by pseudoscan often extends the downstream fragment to overlap with the upstream fragment and
fissionfuser
identifies this fission event signature.
- redundant_fusion_member
- combined_annotation
- putative_misannotation
- identical
- identical_non_cds
- shorter
- shorter_pseudogene
- forfeit : When postprocessed RATT and ab initio annotations are equally valid, RATT is preferred since its name assignment derives from synteny.
- internal_stop
- worse_ref_correspondence
- pseudo
- unnamed : When an ab initio annotation for which a name could not be assigned using blastp hits conflicts with a RATT annotation, the ab initio annotation is rejected for this reason.
- locus_tag: Locus tag of the annotated feature from the source indicated by the parent directory.
- gene_name: Assigned gene name (lifted over from reference annotation)
- strand
- og_start Original start position
- og_end Original end position
- new_start Updated start position
- new_end Updated end position
- fixed_start_codon Whether the start codon was corrected ('true' or 'false')
- fixed_stop_codon Whether the stop codon was corrected ('true' or 'false')
- gene_length_diff The percent difference in gene length between the original and updated locations
- status Whether the correction was accepted or rejected
For og_start
, og_end
, new_start
, and new_end
, "start" always corresponds to the low number on the genome and "stop" corresponds to the high number, regardless of strand.
new_start
and new_end
are not necessary modified from the original coordinates.
fixed_start_codon
and fixed_stop_codon
indicate whether they have changed, but these correspond to the strand-adjusted start and stop positions, hence the reference to codons.
A summary of the characteristics of "interesting" features found by pseudoscan.
Such features include all genes to which the pseudo
tag was applied, but also includes non-pseudo genes if they had signatures consistent with a pseudo
but had a redeeming attribute.
{!pseudoscan.report-format.md!}
Elghraoui, A.; Gunasekaran, D.; Ramirez-Busby, S. M.; Bishop, E.; Valafar, F. Hybran: Hybrid Reference Transfer and Ab Initio Prokaryotic Genome Annotation. bioRxiv November 10, 2022, p 2022.11.09.515824. doi:10.1101/2022.11.09.515824.