Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

notes on the evaluation of vg in preprint #2

Open
ekg opened this issue May 28, 2019 · 1 comment
Open

notes on the evaluation of vg in preprint #2

ekg opened this issue May 28, 2019 · 1 comment

Comments

@ekg
Copy link

ekg commented May 28, 2019

Thank you for a beautiful piece of work that is certain to have a long lasting impact on this nascent field! I look forward to carefully examining it and expect to find many applications for your method in future work.

I should apologize for making some annoying comments on twitter about there being bugs in vg's long read alignment model. I thought it'd be better if I clarified them with some analyses based on your datasets. twitter does not have space for this. I thought of making a gist, but perhaps it's fine to discuss this here. Sorry for the length.

There's nothing to fix and nothing for you to do. This post is just to satisfy my need to explain what you found when evaluating vg. @maickrau brought it to my attention and I was able to lodge a resolution.

First, some documentation about vg map that's missing because it's only in my thesis. (Having written this I think I should post it on the vg wiki...)

vg map for long reads

vg map uses a heuristic alignment model for long reads that is somewhat unusual. In it, reads are chunked up into overlapping pieces of a fixed length (for instance 256bp, which is the default, and is meant to keep the mode from getting invoked on short reads, or 128bp, which I've found to work better). Each chunk is multimapped using vg's MEM-based alignment model, wherein a weighted alignment MEM DAG with weights related to MEM collinearity in the read and the graph is built and the best alignments are pulled out with a kind of max-sum dynamic programming algorithm. Then, the chunks are unified with a similar kind of model (call it the alignment chain DAG) that's applied to whole alignment chunks rather than MEMs, yielding a series of alignments that string together specific mappings for a subset of the chunks of the original read. Finally, unaligned bits in this chained alignment are "patched" by a banded alignment of the unaligned portions into the subgraphs downstream or upstream of the nearest mapped chunks in the read.

As you've discovered, this approach is complicated and tricky to get right, and not at all as elegant as an exact alignment algorithm. But it's functional, and I applied it to a number of problems in my PhD studies, including progressive graph construction and a long read to complex graph experiment in the vg NBT paper. One key motivation for the design is that it allows alignment of long reads to graphs of arbitrary complexity, and in principle it can allow for the detection of any kind of structural variation (although the heuristic parameter of the chunk size interacts with the scale of SV that you can detect).

One problem that took me a long time to appreciate is that this model will only work properly if the chunks inside the read are aligned with something like a global alignment algorithm. We want them to be full length, so that he chaining will work and allow the full length alignment to be found as a path through the alignment chain DAG. This issue isn't so bad when you're progressively aligning high quality chromosomes with 1-2% maximum divergence. It becomes much more noticeable with error rates of 10-15% in single molecule reads. Using a local alignment algorithm, which was default in v1.9, results in many failed or heavily softclipped alignments in this case.

evaluating changes in vg long read alignment on L2 and L3

To demonstrate this, I've realigned L2.fq and L3.fq against LRC.vg.

The setup:

vg version v1.16

wget https://alurulab.cc.gatech.edu/sites/all/projectFiles/PaSGAL_Chirag/LRC.vg
wget https://alurulab.cc.gatech.edu/sites/all/projectFiles/PaSGAL_Chirag/L2.fastq
wget https://alurulab.cc.gatech.edu/sites/all/projectFiles/PaSGAL_Chirag/L3.fastq
vg index -x LRC.xg LRC.vg
vg prune LRC.vg >LRC.prune.vg
vg index -g LRC.gcsa LRC.prune.vg

I've aligned the sets in "default" mode, which uses the same scoring parameters as minimap2 ( -u 2 -q 1 -z 2 -o 2 -y 1), and should be fairly similar to how you ran it in the paper. The second mode is "long", and is equivalent to -u 2 -q 1 -z 2 -o 2 -y 1 -L 63 -w 128 -O 32.

Here, the -L 63 provides a bonus that encourages full length alignment. -w 128 and -O 32 change the overlap parameters to be more sensitive. Any other changes from what you evaluated are bugfixes to the alignment code that are shared between both models.

Alignment and table generation:

# for L2
( echo model name identity score | tr ' ' '\t'; vg map -f L2.fastq -d LRC -u 2 -q 1 -z 2 -o 2 -y 1 -j  | jq -cr '[.name, .identity, .score] | @tsv' | sed s/null/0/ | awk '{ print "default", $0 }' | tr ' ' '\t'; vg map -f L2.fastq -d LRC -m long -j  | jq -cr '[.name, .identity, .score] | @tsv' |  sed s/null/0/ |
awk '{ print "long", $0 }' | tr ' ' '\t' ) >vg_compare_L2.samescore.tsv
# for L3
( echo model name identity score | tr ' ' '\t'; vg map -f L3.fastq -d LRC -u 2 -q 1 -z 2 -o 2 -y 1 -j  | jq -cr '[.name, .identity, .score] | @tsv' | sed s/null/0/ | awk '{ print "default", $0 }' | tr ' ' '\t'; vg map -f L3.fastq -d LRC -m long -j  | jq -cr '[.name, .identity, .score] | @tsv' |  sed s/null/0/ | awk '{ print "long", $0 }' | tr ' ' '\t' ) >vg_compare_L3.samescore.tsv

Resulting in histograms of the identity and score metric for L2

vg_compare_L2 samescore score hist

vg_compare_L2 samescore identity hist

And L3:

vg_compare_L3 samescore score hist

vg_compare_L3 samescore identity hist

summary

I can't exactly replicate your result in the paper with this change, but I think this is very suggestive of the same pattern you found, that for L2 vg-heuristic wasn't doing very well, and for L3 it was basically failing to get full length mappings in many cases. To be certain it's the same as what you saw, I'll have to run the alignments with PaSGAL, which #1 blocks me from doing for the moment.

Please take this as you will. It's all really just to note that the heuristic alignment model in vg isn't fundamentally incapable of aligning the long reads. But, you're totally correct: the parameterization in vg v1.9 certainly was, and I can confirm that in the case of this simulated data. Thanks for bearing with me while I document this!

@cjain7
Copy link
Member

cjain7 commented May 29, 2019

Thanks @ekg for the detailed comment, this is good to know for us and in general anyone who is fan of the vg tool :)

We'll need some time to re-run the experiments with the settings you recommend using our test instances. Happy to share the results as soon as I get them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants