-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bismark Alignment Inquiry #654
Comments
Hi Antonio, Welcome to the world of methylation alignments! As a first comment, Accel Swift typically requires some specific trimming, namely
You can of course try to align to a close species at first to get some experience, and then perform it once more you've got a good reference sequence available. The issue that you have experienced already is that the more distantly related the species are, the lower the mapping efficiency will be. The default mapping parameters for Bismark are deliberately very strict, so increasing it to As another issue, you should really re-do the trimming first, this should also greatly enhance the mapping efficiency (while reducing errors). And finally, as soon as you have a reference sequence to align to, you can use Bismark, as it doesn't do anything with the annotations themselves. Once the annotation is at hand, you can rather use it as an additional layer for downstream analysis. |
Hmm, a 1% mapping efficiency is very low indeed. This typically happens either when you have the wrong library type (can you attache the R1/R2 FastQC base composition plots?), or when you align to a completely wrong genome (e.g. human/mouse), or possibly if there were some grave technical errors, e.g. in R2? |
Hi Felix! Thank you for responding. I'm attaching the fastqc report here for one pair of my samples: I am also trying to see if my sequences are contaminated just to make sure that my samples are really the species that I collected. Do you have any recommendations on what software that I can use for this task? I'm reading that Fastq screen can screen contaminants and can be used for bisulfite sequences. Regarding your last suggestion/comment, how do we assess if there is a grave technical error in our sequences? Thank you! |
Thanks for all these plots (they are from the 5' clipped samples, correct?). Overall they all look fine, there are no signs of technical or unexpected issues in the reads (e.g. in the quality, or tile graphs, or the N-count). My next best guess would be that the sample are simply not from the organism you think they are from. If you wanted you could send me small subset of reads (e.g. 200,000, untrimmed, fastq.gz compressed) via email, and I could take a quick look? |
Hi Felix! sorry for the late reply. Our lab had a field work for 3 weeks field work. May I ask what software would you suggest for me to be able to get a small subset of my sequences for the second fish species (fish2)? may I also ask for your email? is this email correct: [email protected]? Currently we were done with the alignment of the first fish species ( Fish1; the one that I described that the mapping efficiency jumped from 30% to 50% using the default parameters of Bismark (bismark Parameter: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)). Here is the result for report generation: I just have another question for Fish1: during alignment, bismark is saying that "Chromosomal sequence could not be extracted for..." as shown in the figure below. is this to be expected since we are using genome at the scaffold level? |
Yes, these warnings show up more frequently if you have short chromosomes or scaffolds. Typically the number is negligibly small, in your case ~0.05%. Just ignore. To interact with reads you can just use
|
Thanks for the sample files. The sample seems to have been processed using the Accel Swift kit, which introduces some hefty biases, especially at the start of Read 2: ![]() As a solution to this you just need to clip your reads from all ends, like so:
I am convinced after that the reads will align just fine. Let me know how you get on! |
Hi Felix! thanks again for accommodating my questions regarding the alignment of our samples. we had a major hiccup for our project since there was an issue about the identity of one of our target fish species that is why our wgbs data cannot align properly with the available genome in NCBI (with mapping efficiency of ~0.1%). regardless, our group is currently working on the other fish species that have a much better alignment (the one that you checked; representative alignment report below:) [specified options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Final Alignment reportSequence pairs analysed in total: 109077988 Number of sequence pairs with unique best (first) alignment came from the bowtie output: Number of alignments to (merely theoretical) complementary strands being rejected in total: 0 Final Cytosine Methylation ReportTotal number of C's analysed: 2555498641 Total methylated C's in CpG context: 185233029 Total unmethylated C's in CpG context: 55597280 C methylated in CpG context: 76.9% I also tried using less stringent alignment parameters (see representative report below:) [specified options: -q --score-min L,0,-0.6 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Final Alignment reportSequence pairs analysed in total: 109378293 Number of sequence pairs with unique best (first) alignment came from the bowtie output: Number of alignments to (merely theoretical) complementary strands being rejected in total: 0 Final Cytosine Methylation ReportTotal number of C's analysed: 3712028338 Total methylated C's in CpG context: 272265530 Total unmethylated C's in CpG context: 94436141 C methylated in CpG context: 74.2% we also had done deduplication using --score-min L,0,-0.2 results/alignment (see representative report below:) _bismark_bt2_pe.bam: 55919438 Total count of deduplicated leftover sequences: 48784544 (87.24% of total) as of right now, I want to do the next step of the pipeline which is methylation calling. however, my concern is that most of the papers that I've read says that I should use annotated genome during methylation calling however, our current available genome is still being annotated and what we have is only the assembled genome. is it ok to use the only assembled genome for methylation calling and for subsequent downstream analyses (DMRs/DMGs)? so far I tried to run methylation calling for one sample (see report below:) [script/commands that I used: bismark_methylation_extractor --comprehensive --bedGraph --gzip --scaffolds --cytosine_report --genome_folder] after I ran methylation calling, I got these output files: When I opened the [Sample]_bismark_bt2_pe.deduplicated.cytosine_context_summary.txt file, there was no reports that are in the txt files. however [Sample]_pe.deduplicated.M-bias.txt and [Sample]_bismark_bt2_pe.deduplicated_splitting_report.txt had reports (see representative report of [Sample]_bismark_bt2_pe.deduplicated_splitting_report.txt below:) [Sample]bismark_bt2_pe.deduplicated.bam Parameters used to extract methylation information: Processed 48784544 lines in total Final Cytosine Methylation ReportTotal number of C's analysed: 1891803435 Total methylated C's in CpG context: 136624254 Total C to T conversions in CpG context: 40920797 C methylated in CpG context: 77.0% is my approach correct?? I hope that you can again help me with this and thank you in advance for answering! |
Technically, the methylation already takes place during the alignment step. As the name suggests, the methylation extractor merely extracts the calls and puts them into different formats, which is described in more detail here: https://felixkrueger.github.io/Bismark/bismark/methylation_extraction/ One thing to mention here would be that everything up to and including the coverage comes from the original (or deduplicated) BAM files. Only the last step, generating a cytosine report, requires the genome to be specified again. This genome hast to be the same genome as the one used for the alignment step. Again, more details can be found at the link above. Does that make sense? |
Hi Felix,
I have a couple of questions regarding Bismark. I'm sorry if they are stupid questions; I am fairly new to WGBS data and Bismark analysis (or to bioinformatics in general).
We are researching non-model fish animals. Recently, their genomes have become available in the NCBI database, and their assembly level is Scaffold (Additionally, it is still unannotated). Concurrently, there is another project in our laboratory to sequence our target organisms. I, in particular, have been tasked with analyzing WGBS data from our target animal. I already have the raw WGBS reads (Library type: Accel Methyl-Seq DNA library), and I have already trimmed them using Trim Galore (--paired --fastqc). While we are waiting for the genome of our target samples from our own laboratory (we had a few hiccups with our genome sequencing, which is why our WGBS reads came first), can I use a different genome (that is already annotated) from another species to align my WGBS reads?
I tried aligning our methylome reads to the genome of D. rerio using the default parameters (-q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)), which resulted in a mapping efficiency of 0.1%. I also tried aligning my WGBS reads to the unannotated genome of our target species (the one I mentioned earlier that has been deposited to the NCBI database) using also default parameters (-q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)), and I only achieved 31% mapping efficiency. I am reading from your GitHub that I should use --score-min L,0,-0.6, which I will be trying. My main question is, is it alright to use the unannotated genome as my reference for Bismark alignment (I will also try aligning our methylome reads to the genome that was produced by our laboratory once it is finished)? And does Bismark only align methylome reads from the same species, or will my methylome reads be able to align to the genomes of closely related species?
Thank you
The text was updated successfully, but these errors were encountered: