Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bismark Alignment Inquiry #654

Open
KeemAntonio opened this issue Feb 16, 2024 · 10 comments
Open

Bismark Alignment Inquiry #654

KeemAntonio opened this issue Feb 16, 2024 · 10 comments

Comments

@KeemAntonio
Copy link

Hi Felix,

I have a couple of questions regarding Bismark. I'm sorry if they are stupid questions; I am fairly new to WGBS data and Bismark analysis (or to bioinformatics in general).

We are researching non-model fish animals. Recently, their genomes have become available in the NCBI database, and their assembly level is Scaffold (Additionally, it is still unannotated). Concurrently, there is another project in our laboratory to sequence our target organisms. I, in particular, have been tasked with analyzing WGBS data from our target animal. I already have the raw WGBS reads (Library type: Accel Methyl-Seq DNA library), and I have already trimmed them using Trim Galore (--paired --fastqc). While we are waiting for the genome of our target samples from our own laboratory (we had a few hiccups with our genome sequencing, which is why our WGBS reads came first), can I use a different genome (that is already annotated) from another species to align my WGBS reads?

I tried aligning our methylome reads to the genome of D. rerio using the default parameters (-q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)), which resulted in a mapping efficiency of 0.1%. I also tried aligning my WGBS reads to the unannotated genome of our target species (the one I mentioned earlier that has been deposited to the NCBI database) using also default parameters (-q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)), and I only achieved 31% mapping efficiency. I am reading from your GitHub that I should use --score-min L,0,-0.6, which I will be trying. My main question is, is it alright to use the unannotated genome as my reference for Bismark alignment (I will also try aligning our methylome reads to the genome that was produced by our laboratory once it is finished)? And does Bismark only align methylome reads from the same species, or will my methylome reads be able to align to the genomes of closely related species?

Thank you

@FelixKrueger
Copy link
Owner

Hi Antonio,

Welcome to the world of methylation alignments! As a first comment, Accel Swift typically requires some specific trimming, namely

--clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10

You can of course try to align to a close species at first to get some experience, and then perform it once more you've got a good reference sequence available. The issue that you have experienced already is that the more distantly related the species are, the lower the mapping efficiency will be. The default mapping parameters for Bismark are deliberately very strict, so increasing it to --score-min L,0,-0.6 will allow 3 times as many errors as the default mode.

As another issue, you should really re-do the trimming first, this should also greatly enhance the mapping efficiency (while reducing errors).

And finally, as soon as you have a reference sequence to align to, you can use Bismark, as it doesn't do anything with the annotations themselves. Once the annotation is at hand, you can rather use it as an additional layer for downstream analysis.

@KeemAntonio
Copy link
Author

Hi Felix!

Thank you for your prompt response! we have re-done trimming using <--clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10> in TrimGalore! we have seen significant improvement as you can see in per base sequence
new trim
we have also seen a significant increase on the mapping efficiency of one of our samples on one of our target organism (I am analyzing two different species of non-model fish) with a jump from ~30% to 51% using the newly trimmed reads (bismark Parameter: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)). we have also tried aligning using --score-min L,0,-0.4 and got an increase from our old trimmed sequences (57.7%) comparing it to new trimmed sequences using the suggested parameter (66.9%). Lastly, we also aligned using --score-min L,0,-0.6 and got an increase from 72.9% to 77.0%. Thank you for this!

I just have a few more questions. since I am working on another species of fish we tried doing the same steps/parameters for trimming (--clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10) but when we aligned it to its assembled genome that we got from NCBI database (note that we are still in the midst of our own genome assembly) we only got a 1.2% mapping efficiency using this parameter (-q --score-min L,0,-0.6 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)) here is the final report:

Final Alignment report

Sequence pairs analysed in total: 94422002
Number of paired-end alignments with a unique best hit: 1156101
Mapping efficiency: 1.2%
Sequence pairs with no alignments under any condition: 93197502
Sequence pairs did not map uniquely: 68399
Sequence pairs which were discarded because genomic sequence could not be extracted: 4698

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 573176 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 578227 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 56000878

Total methylated C's in CpG context: 3779225
Total methylated C's in CHG context: 236041
Total methylated C's in CHH context: 537007
Total methylated C's in Unknown context: 15090

Total unmethylated C's in CpG context: 3543149
Total unmethylated C's in CHG context: 11928857
Total unmethylated C's in CHH context: 35976599
Total unmethylated C's in Unknown context: 292947

C methylated in CpG context: 51.6%
C methylated in CHG context: 1.9%
C methylated in CHH context: 1.5%
C methylated in Unknown context (CN or CHN): 4.9%

I have a feeling that the available genome that I am using in this species is highly fragmented

the assembly statistics is:

Genome size | 709.4 Mb
Total ungapped length | 577.7 Mb
Number of scaffolds | 191,173
Scaffold N50 | 11.4 Mb
Scaffold L50 | 22
Number of contigs | 281,720
Contig N50 | 2.6 kb
Contig L50 | 67,375
GC percent | 44.5
Genome coverage | 65.0x
Assembly level | Scaffold

do you have any recommendations on how to proceed on this problem? Thank you so much!

@FelixKrueger
Copy link
Owner

Hmm, a 1% mapping efficiency is very low indeed. This typically happens either when you have the wrong library type (can you attache the R1/R2 FastQC base composition plots?), or when you align to a completely wrong genome (e.g. human/mouse), or possibly if there were some grave technical errors, e.g. in R2?

@KeemAntonio
Copy link
Author

Hi Felix! Thank you for responding. I'm attaching the fastqc report here for one pair of my samples:

R1:
per_sequence_gc_content
per_sequence_quality
per_tile_quality
sequence_length_distribution
adapter_content
duplication_levels
per_base_n_content
per_base_quality
per_base_sequence_content

R2:
adapter_content
duplication_levels
per_base_n_content
per_base_quality
per_base_sequence_content
per_sequence_gc_content
per_sequence_quality
per_tile_quality
sequence_length_distribution

I am also trying to see if my sequences are contaminated just to make sure that my samples are really the species that I collected. Do you have any recommendations on what software that I can use for this task? I'm reading that Fastq screen can screen contaminants and can be used for bisulfite sequences.

Regarding your last suggestion/comment, how do we assess if there is a grave technical error in our sequences?

Thank you!

@FelixKrueger
Copy link
Owner

Thanks for all these plots (they are from the 5' clipped samples, correct?). Overall they all look fine, there are no signs of technical or unexpected issues in the reads (e.g. in the quality, or tile graphs, or the N-count).

My next best guess would be that the sample are simply not from the organism you think they are from. If you wanted you could send me small subset of reads (e.g. 200,000, untrimmed, fastq.gz compressed) via email, and I could take a quick look?

@KeemAntonio
Copy link
Author

Hi Felix! sorry for the late reply. Our lab had a field work for 3 weeks field work. May I ask what software would you suggest for me to be able to get a small subset of my sequences for the second fish species (fish2)? may I also ask for your email? is this email correct: [email protected]?

Currently we were done with the alignment of the first fish species ( Fish1; the one that I described that the mapping efficiency jumped from 30% to 50% using the default parameters of Bismark (bismark Parameter: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)). Here is the result for report generation:

newplot (6)
newplot (7)
Bismark M-bias Read 1
Bismark M-bias Read 2
newplot (3)
newplot (4)
newplot (5)

I just have another question for Fish1:

during alignment, bismark is saying that "Chromosomal sequence could not be extracted for..." as shown in the figure below. is this to be expected since we are using genome at the scaffold level?
image

@FelixKrueger
Copy link
Owner

Yes, these warnings show up more frequently if you have short chromosomes or scaffolds. Typically the number is negligibly small, in your case ~0.05%. Just ignore.

To interact with reads you can just use gzip/gunzip, e.g. like so:

gunzip -c inputfile.fastq.gz | head -800000 | gzip - > output_200K.fastq.gz

@FelixKrueger
Copy link
Owner

Thanks for the sample files. The sample seems to have been processed using the Accel Swift kit, which introduces some hefty biases, especially at the start of Read 2:

Screenshot 2024-03-25 at 13 53 49

As a solution to this you just need to clip your reads from all ends, like so:

trim_galore --clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10

I am convinced after that the reads will align just fine. Let me know how you get on!

@KeemAntonio
Copy link
Author

Hi Felix! thanks again for accommodating my questions regarding the alignment of our samples. we had a major hiccup for our project since there was an issue about the identity of one of our target fish species that is why our wgbs data cannot align properly with the available genome in NCBI (with mapping efficiency of ~0.1%). regardless, our group is currently working on the other fish species that have a much better alignment (the one that you checked; representative alignment report below:)

[specified options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500
Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)]

Final Alignment report

Sequence pairs analysed in total: 109077988
Number of paired-end alignments with a unique best hit: 55975773
Mapping efficiency: 51.3%
Sequence pairs with no alignments under any condition: 52764057
Sequence pairs did not map uniquely: 338158
Sequence pairs which were discarded because genomic sequence could not be extracted: 56335

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 27777757 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 28141681 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 2555498641

Total methylated C's in CpG context: 185233029
Total methylated C's in CHG context: 2538723
Total methylated C's in CHH context: 8000082
Total methylated C's in Unknown context: 58299

Total unmethylated C's in CpG context: 55597280
Total unmethylated C's in CHG context: 561300346
Total unmethylated C's in CHH context: 1742829181
Total unmethylated C's in Unknown context: 1314730

C methylated in CpG context: 76.9%
C methylated in CHG context: 0.5%
C methylated in CHH context: 0.5%
C methylated in unknown context (CN or CHN): 4.2%

image

I also tried using less stringent alignment parameters (see representative report below:)

[specified options: -q --score-min L,0,-0.6 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500
Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)]

Final Alignment report

Sequence pairs analysed in total: 109378293
Number of paired-end alignments with a unique best hit: 83632232
Mapping efficiency: 76.5%
Sequence pairs with no alignments under any condition: 24581397
Sequence pairs did not map uniquely: 1164664
Sequence pairs which were discarded because genomic sequence could not be extracted: 545836

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 40287619 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 42798777 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 3712028338

Total methylated C's in CpG context: 272265530
Total methylated C's in CHG context: 4863260
Total methylated C's in CHH context: 15651691
Total methylated C's in Unknown context: 359661

Total unmethylated C's in CpG context: 94436141
Total unmethylated C's in CHG context: 804207651
Total unmethylated C's in CHH context: 2520604065
Total unmethylated C's in Unknown context: 8702049

C methylated in CpG context: 74.2%
C methylated in CHG context: 0.6%
C methylated in CHH context: 0.6%
C methylated in unknown context (CN or CHN): 4.0%

we also had done deduplication using --score-min L,0,-0.2 results/alignment (see representative report below:)

_bismark_bt2_pe.bam: 55919438
Total number duplicated alignments removed: 7134894 (12.76%)
Duplicated alignments were found at: 6149664 different position(s)

Total count of deduplicated leftover sequences: 48784544 (87.24% of total)

as of right now, I want to do the next step of the pipeline which is methylation calling. however, my concern is that most of the papers that I've read says that I should use annotated genome during methylation calling however, our current available genome is still being annotated and what we have is only the assembled genome. is it ok to use the only assembled genome for methylation calling and for subsequent downstream analyses (DMRs/DMGs)? so far I tried to run methylation calling for one sample (see report below:)

[script/commands that I used: bismark_methylation_extractor --comprehensive --bedGraph --gzip --scaffolds --cytosine_report --genome_folder]

after I ran methylation calling, I got these output files:

image
image

When I opened the [Sample]_bismark_bt2_pe.deduplicated.cytosine_context_summary.txt file, there was no reports that are in the txt files. however [Sample]_pe.deduplicated.M-bias.txt and [Sample]_bismark_bt2_pe.deduplicated_splitting_report.txt had reports (see representative report of [Sample]_bismark_bt2_pe.deduplicated_splitting_report.txt below:)

[Sample]bismark_bt2_pe.deduplicated.bam

Parameters used to extract methylation information:
Bismark Extractor Version: v0.23.1
Bismark result file: paired-end (SAM format)
Output specified: comprehensive
No overlapping methylation calls specified

Processed 48784544 lines in total
Total number of methylation call strings processed: 97569088

Final Cytosine Methylation Report

Total number of C's analysed: 1891803435

Total methylated C's in CpG context: 136624254
Total methylated C's in CHG context: 1889811
Total methylated C's in CHH context: 5987407

Total C to T conversions in CpG context: 40920797
Total C to T conversions in CHG context: 415950132
Total C to T conversions in CHH context: 1290431034

C methylated in CpG context: 77.0%
C methylated in CHG context: 0.5%
C methylated in CHH context: 0.5%

is my approach correct??

I hope that you can again help me with this and thank you in advance for answering!

@FelixKrueger
Copy link
Owner

Technically, the methylation already takes place during the alignment step. As the name suggests, the methylation extractor merely extracts the calls and puts them into different formats, which is described in more detail here: https://felixkrueger.github.io/Bismark/bismark/methylation_extraction/

One thing to mention here would be that everything up to and including the coverage comes from the original (or deduplicated) BAM files. Only the last step, generating a cytosine report, requires the genome to be specified again. This genome hast to be the same genome as the one used for the alignment step. Again, more details can be found at the link above. Does that make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants