-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to encode very short reference sequences. #33
Comments
Thank you for the detailed documentation of the problem with encoding short reference sequences. The explanation for the 84-base minimum is simple: the nongapped seeding algorithm uses an 84-bit spaced seed. It would be possible to retool Arioc to allow some flexibility in the spaced seed, but that would require a lot of testing and validation before it could be used. But that limitation should not apply in the situation where you aggregate your short reference sequences in a single FASTA file. The encoded binary representation puts padding between the sequences (analogous to your experiment with adding Ns to short reference base sequences) and handles both sequence naming (extracted from FASTA definition lines) and sequence topology (linear or circular). From the AriocE log, it looks like there were over a billion seeds computed -- but none of them made it into the hash table! I cannot see the reason for this from the log, but I suspect this FASTA file embodies an edge case that's never been tested, namely, a set of reference sequences all of which are too short to be seeded with 84-bit seeds. If so, AriocE needs to handle this more gracefully than simply failing with May I ask: what is the distribution of sequence length in the sequences in the FASTA file? |
Edit: Thank you for your assistance ahead of time! The sequences in this case are all 150 bp reads from an Illumina nextseq 1000. My particular use case requires to align 400+ million reads that are all exactly the same in most regions, with an area of variation (That's the regions with N's) that we are trying to extract. We then count how many times we see a sequence to conclude an enrichment score. On that note as an aside, I noticed that the output says that Arioc discards duplicate reads, would that mean that by default Arioc would not be suited for the task? I was thinking if thats the case I could just edit the source files locally to just skip that part! We do have reference sequences that can go up to 1000 bp (rarely) and reads that can go up to 500 bp, but I just happened to choose this test case first because it was pretty simple. |
Hi, Jarrett -- My question was, what is the distribution of sequence length in the sequences in the FASTA file, that is, the distribution of sequence length of the reference sequences in You've specified a wildcard SN ( I can't yet tell from what I can see so far, but something is clearly going haywire when AriocE tries to encode your reference sequences for nongapped alignment. If nothing else, I want to fix AriocE's handling of this situation so that emits an informative error message and not just garble! If you prefer, share your input FASTA file somewhere, or send it to me via some file transfer service, and I'll put AriocE under a debugger and see where things are going sideways. As for the question about "duplicate reads": I'm not sure what you're referring to. Certainly the Arioc aligners (AriocU and AriocP) do not check for duplicate reads. There are dedicated software tools for that. Could you please tell me more specifically what you're referring to? Thanks! |
Oh! I'm sorry, I named the top "file" incorrectly, instead of reference_sequences.fa is should be reference_templates.fa . Those two (Mutant and WT) are the actual reference sequences I tried to encode. I really was only using two in this instance. I'll write them below again for quick reference.
I did notice that it was specifically that repeated "N"s that caused the issue with the nongapped encoding. If I make sure that there are 84 non-degenerate bases on either side of the N's it works. If I take the N's away it works. But if there are less than 84 non-degenerate bases on both sides of the N's, then we run into the issue. To get it to run (I wanted to test the speed regardless of this issue) I simply added padding non-degenerate bases to the reference sequences to get around this. Though this understandably caused issues with alignment so I figured I'd have to open an issue here to dig deeper.
I'm referring to the line in the output of for alignment that says "duplicate mappings (unreported)". I think I was imprecise in my words when I said "duplicate reads" instead of "duplicate mappings".
|
I think I see the problem. Given a reference sequence where there is no contiguous set of adjacent non-N bases that is at least as long as the nongapped seed length, AriocE is (correctly) reporting that no seeds were found but (incorrectly) failing to emit a usable empty list of seed locations for that reference sequence. (As you have observed, reads can still be aligned to such sequences using gapped alignment seeds, which are much shorter.) I will dig into this and fix it. As for the duplicate mappings: this is a small piece of information that can be helpful in performance tuning. When the aligner (AriocU or AriocP) is configured to report secondary mappings (i.e., the The aligner does not report the same mapping twice for the same read, but it does count them. When the aligner counts a lot of these duplicate mappings, you might increase speed without affecting sensitivity by decreasing |
Ah! Thank you for the clarification on the duplicated mappings. I will continue on my integration of Arioc and let you know of any other edge cases I run into! |
I have created an updated Arioc v1.52 as a preliminary release. It contains a fix for the problem you reported. |
I can confirm I am able to encode my short references without padding now! That's awesome thanks! |
That is good news. Thank you for reporting the problem and giving us a chance to resolve it. |
Hello!
I've been trying to Align against some small reference sequences but run into the error "maximum subunit ID in nongapped J table: 127; in gapped J table: 1". As far as I can tell, this is due to the nongapped encoding requiring 84 consecutive NT's without any N's. When I pad the reads with ACTGs such that there is 84 NT's on either side of the string of N's, it works. Is there a way to circumvent this without padding the reference sequences?
It's very possible what I need out of an aligner doesn't mesh with Arioc, but the potential speed up was too much to ignore!
Edit: Changed name of file to avoid confusion
reference_templates.fa
Encode_gapped.cfg
Encode_nongapped.cfg
Encode_Reads.cfg
Align_upaired.cfg
Align log output:
Non Gapped Encoding log output
The text was updated successfully, but these errors were encountered: