Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about assigning 'unassignable' reads #79

Open
RobertBaird opened this issue Mar 5, 2024 · 2 comments
Open

Question about assigning 'unassignable' reads #79

RobertBaird opened this issue Mar 5, 2024 · 2 comments

Comments

@RobertBaird
Copy link

Hi, I have a question about how reads are assigned as conflicting, unassignable etc.

I've been trying to use SNPsplit on noisy PacBio long reads to help with haplotype-resolved assembly (which I know is a little optimistic), and I'm losing a lot of reads. Here's the allele-tagging report:

Processed 7590179 read alignments in total
Reads were unaligned and hence skipped: 264205 (3.48%)
5789031 reads were unassignable (76.27%)
111956 reads were specific for genome 1 (1.48%)
69824 reads were specific for genome 2 (0.92%)
155723 reads did not contain one of the expected bases at known SNP positions (2.05%)
611385 contained conflicting allele-specific SNPs (8.05%)

I'm expecting around 5% of reads to be assigned to each genome, so it looks like I'm losing a lot due to reads either not containing an expected base at known SNP positions (presumably because of single-base or insertion sequencing errors), or reads containing conflicting SNPs (presumably mainly due to sequencing error deletions of one of the N-masked sites).

Is there a way to tune these features so that e.g. a read can still be assigned to genome 1 if it contains a genome 1-specific SNP, but has something other than the genome 2 SNP at another N-masked position (like an obvious sequencing error)?

Thanks!

@FelixKrueger
Copy link
Owner

Hmm, this is a tricky question. Frankly, I don't think SNPsplit is suited for this kind of scenario (long, noisy reads) and there is little if any functionality currently to allow fine tuning in it's current form.

The majority of reads seem to get assigned to the 'conflicting' category. Maybe it is worth visualising this by using the --conflicting option, and then looking at the reads in IGV? Maybe this can yield some clues if some additional post-processing could be applied?
Apologies if isn't very helpful, but the intended use-case really is short reads in combination with fully characterised mouse strains. It feels like the long read field should have some better solutions for allele-specific assignment?

@RobertBaird
Copy link
Author

Hey Felix, thanks for the response. It seems like I'm losing most of my reads to deletions of the N-masked sites, which arise from sequencing errors. I agree that it's definitely not suited to long noisy reads - maybe HiFi reads would work though. Unfortunately there doesn't seem to be much out there for allele-specific sorting of reads.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants