% BIODIFF(1) 0.3 % Afif Elghraoui [email protected] % August 2021
biodiff - exact comparison of biological sequences
biodiff reference.fasta query.fasta > out.vcf
biodiff is a variant caller that determines the exact differences between two biological sequences. It can operate on DNA and protein sequences, as long as they are in fasta format, as well as on synteny blocks in the fasta-like UniMoG format. The sequences to be compared must have the same fasta header (up to the first whitespace). If the reference and query each have only one sequence, however, the header need not match and the comparison will be done, but a warning will be emitted. Output is in the Variant Call Format.
biodiff uses diff(1), an implementation of Myer's algorithm to find longest common substrings and determine the minimal difference between the sequences. It is especially useful for exact genome comparison, as standard genome comparison tools are often vague regarding the positions of large insertions and deletions. It can be helpful to first get an accurate picture of the plain insertions and deletions that differentiate two sequences, before trying to decide whether they represent translocations, tandem copy number variation, or anything else.
You might want to quickly see the difference between two revisions of an NCBI reference sequence.
biodiff NC_123456.1.fasta NC_123456.2.fasta
The output goes to your terminal.
It works the same way with amino acid sequences:
biodiff wild-type.faa mutant.faa
Synteny blocks are conventionally defined in a fasta-like format with the sequence being a space-delimited strings, usually numbers.
You will typically want to normalize and left-align variants, especially when comparing variants between different samples, as variants within repetitive sequences can be accurately represented in multiple ways and falsely appear to be different.
for query in query1 query2
do
biodiff reference.fasta $query.fasta | bcftools norm --fasta-ref reference.fasta - > $query.vcf
bgzip $query.vcf && tabix -p vcf $query.vcf.gz
done
bcftools isec query1.vcf.gz query2.vcf.gz
biodiff currently does not properly handle in-place inversions.
Please report issues at https://gitlab.com/LPCDRP/biodiff/issues
bcftools(1) dnadiff(1) Assemblytics(1)