-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-mapping reads and low mapping efficiency #656
Comments
There is quite a lot going on in this list, not quite sure I can give you a good recommendation. I have a few comments:
I don't think I received any email with reads, just in case you wanted to re-send them? (I would also need to know what the genome is you are trying to align to). |
Hi Felix, 1. the TruSeq kit should be directional, correct? Could you attach the base composition plot of FastQC (of the untrimmed reads) to take a look? If it is directional, you should not use it non-directional I agree with you that the libraries are directional and I should stick to their analysis in that mode. I tried running the non-directional mode just to be sure that it was not creating a lot of difference in mapping efficiency. I am attaching the base composition plot of the untrimmed reads for your reference. 2. if the kit is directional, you shouldn't get any alignments in --pbat mode. Your number aligned reads all look fairly similar? Regarding the PBAT run, instead of taking the reads as paired-end, I treated them as single reads and ran in the PBAT mode trying something similar to what we do in the Dirty Harry approach.
I have also sent you an email with the sample reads with information about the reference genome. Thanks again! |
Thanks for providing additional details. The base composition plot is quite informative, as it shows:
Looking at the alignment report you seem to have a split of roughly the following alignments:
So overall, you've got >90% of reads originating from your plant, which is good news. I am afraid there isn't much we can do about the multi-mapping of reads. Either the reference genome you are working with is still a bit crude, with similar multi-mapping scaffolds still included, or it really is that repetitive.... I just discovered your email in my Spam folder, but I don't think there is a lot else I could contribute currently, let me know your thoughts. |
I am currently using Bismark to analyze DNA methylation data from root/nodule samples from a legume plant species and am facing issues regarding multi-mapping reads and low mapping efficiencies.
![Screenshot 2024-02-19 at 1 56 59 pm](https://private-user-images.githubusercontent.com/120827418/306234547-4f309e55-222a-4d64-9795-02b90692f5f1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE4NTUzMDUsIm5iZiI6MTcyMTg1NTAwNSwicGF0aCI6Ii8xMjA4Mjc0MTgvMzA2MjM0NTQ3LTRmMzA5ZTU1LTIyMmEtNGQ2NC05Nzk1LTAyYjkwNjkyZjVmMS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzI0JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcyNFQyMTAzMjVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1jNWQxYjQwYTExMDNjYmE1NGZhNmYyOTk1ZmEwNWY0OTdiNzc0Yzc1MjAxOGI4NDZlOTcyZjM4YTlmNTMzYzM5JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.ZpUS9cXQfCJDxt64y3Va9wV-8tg6ZIZip0VRZJbfAdg)
My PE100 libraries were prepared using TruSeq DNA Methylation kit thus, are directional. The spike-in control is lambda phage. FastQC of the samples allowed me to realize that a few bases at the 5' end were creating trouble. I have tried running Bismark in various permutations using both clipped (9N and 6N trimmed form 5' and 3'end, respectively) and non-clipped fastq files. However, in none of the attempts, my mapping efficiency could go beyond ~40-45%. The conditions I have tried for Bismark run along with their mapping efficiencies are shown in the picture
:
Although reads are mapping (~79-90%) on the reference genome, the unique mapping efficiencies are low. A large part of the sequences are multimapping in nature. Moreover, the unique mapping goes down when clipped fastq files are used (see Condition 3 Vs 4). Even local alignment could not generate significantly high mapping statistics. Among these conditions, I get the highest mapping percent only if I relax the stringency to a scoremin of 0.6. Could you please recommend amendments or conditions to improve the mapping efficiencies of unique reads? Any insights or considerations for solving this low efficiency would be highly appreciated.
I have also forwarded you an email related to this with some sample sequences in case you would like to take a look at the data.
Thanks in advance.
The text was updated successfully, but these errors were encountered: