Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison between modkit and modbam2bed. #14

Open
qiangfu-eu opened this issue May 9, 2023 · 31 comments
Open

Comparison between modkit and modbam2bed. #14

qiangfu-eu opened this issue May 9, 2023 · 31 comments

Comments

@qiangfu-eu
Copy link

qiangfu-eu commented May 9, 2023

Following discussion already have in post #2, I did some tests on two of our samples. However, I observe different results among them. Note that for the simplicity,

  • the base calling is done with 5mC model only, hence no 5hmC is reported.
  • methylation analysis focuses on CpG only, and the results of both strands are aggregated.
  • the same cutoff (estimated by modkit) are used for both methods.

On puc19 control samplem both methods detected comparable methylation ratio, however the count statistics are very different. There is 50% difference on # of valid calls. More filtered calls, more nocalls from modbam2bed.

For modbam2bed, I run analysis and summary as follow. Note that 'cpg.acc' bed file is used as it is the one contains aggregated results.

> modbam2bed -m 5mC -t 32 -e --cpg --aggregate -f 0.6074219 neb_puc19.fa PAK67493_barcode19_5mC_sorted.bam > modbam2bed.puc19.CpG.pileup.bed 2>modbam2bed.puc19.CpG.pileup.log
> awk '$4!="nan" {can+=$12; mod+=$13; oth+=$16; total+=$10; filter+=$14; nocall+=$15} END{valid=can+mod+oth; print (can) " ("(can/valid)") CpG canonical\n" (mod) " (" (mod/valid) ") 5mCpG modified\n" (oth) " (" (oth/valid) ") 5hmCpG modified\n" (valid) " valid\n"(filter) " filtered, NA diff nt, NA del, " (nocall) " nocall, " (total) " total."}' modbam2bed.puc19.CpG.pileup.cpg.acc.bed
229649 (0.128886) CpG canonical
1552155 (0.871114) 5mCpG modified
0 (0) 5hmCpG modified
1781804 valid
151311 filtered, NA diff nt, NA del, 35149 nocall, 1986678 total.

For modkit, I run analysis and summary as follow.

> modkit pileup -t 3 --cpg --ref neb_puc19.fa --combine-strands --filter-threshold 0.6074219 --log-filepath puc19.CpG.pileup.log PAK67493_barcode19_5mC_sorted.bam puc19.CpG.pileup.bed
> awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10; del+=$15; filter+=$16; diff+=$17; nocall+=$18} END{print (can) " ("(can/valid)") CpG canonical\n" (mod) " (" (mod/valid) ") 5mCpG modified\n" (oth) " (" (oth/valid) ") 5hmCpG modified\n" (valid) " total valid \n" (filter) " filtered, " (diff) " diff nt, " (del) " del, " (nocall) " nocall."}' puc19.CpG.pileup.bed
157106 (0.135189) CpG canonical
1005016 (0.864811) 5mCpG modified
0 (0) 5hmCpG modified
1162122 total valid
100066 filtered, 5921 diff nt, 5647 del, 15012 nocall.
@qiangfu-eu
Copy link
Author

Next we tested one human sample, I restrict the analysis only on chr1 to save time. However, here there is 6% difference between the methylation level estimated by two methods, with modit constently estimates lower level of methylation. Also here the # of valid calls are similar, but clearly the distribution between canonical calls and 5mC calls are very different.

For modbam2bed, I run analysis and summary as follow. Similarly the bed file with aggregated value is used for statistics.

> modbam2bed -t 3 -m 5mC --cpg --aggregate --region 'chr1' -p 'PAI55644_EPI9.modkit.f066.CpG.pileup.chr1' -e GCA_000001405.15_GRCh38_no_alt_analysis_set.fna PAI55644_EPI9_Sotos_dorado_sorted.bam 1>PAI55644_EPI9.5mC.CpG.pileup.chr1.bed 2>run_5mC_CpG.pileup.chr1.log
> awk '$4!="nan" {can+=$12; mod+=$13; oth+=$16; total+=$10; filter+=$14; nocall+=$15} END{valid=can+mod+oth; print (can) " ("(can/valid)") CpG canonical\n" (mod) " (" (mod/valid) ") 5mCpG modified\n" (oth) " (" (oth/valid) ") 5hmCpG modified\n" (valid) " valid\n"(filter) " filtered, NA diff nt, NA del, " (nocall) " nocall, " (total) " total."}' PAI55644_EPI9.5mC.CpG.pileup.chr1.cpg.acc.bed
9896690 (0.139618) CpG canonical
60987562 (0.860382) 5mCpG modified
0 (0) 5hmCpG modified
70884252 valid
5942656 filtered, NA diff nt, NA del, 8965300 nocall, 103314258 total.

modkit analysis

> modkit pileup  -t 3 --filter-threshold 0.66 --cpg --combine-strands --region 'chr1' --log-filepath run_modkit.f066.CpG.chr1.log PAI55644_EPI9_Sotos_dorado_sorted.bam PAI55644_EPI9.modkit.f066.CpG.pileup.chr1.bed --ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
> awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10; del+=$15; filter+=$16; diff+=$17; nocall+=$18} END{print (can) " ("(can/valid)") CpG canonical\n" (mod) " (" (mod/valid) ") 5mCpG modified\n" (oth) " (" (oth/valid) ") 5hmCpG modified\n" (valid) " total valid \n" (filter) " filtered, " (diff) " diff nt, " (del) " del, " (nocall) " nocall."}' PAI55644_EPI9.modkit.f066.CpG.pileup.chr1.bed
13346630 (0.193721) CpG canonical
55549626 (0.806279) 5mCpG modified
0 (0) 5hmCpG modified
68896256 total valid
5320885 filtered, 3461627 diff nt, 9954251 del, 4154981 nocall.

@qiangfu-eu
Copy link
Author

I am not sure what causing such differences. As I would expect that when using the same BAM file with same setting, I should obtain similar results. What do you think is the possible reason for this differences? And inconsistency betweeen two datasets?

Many thanks.

@ArtRand
Copy link
Contributor

ArtRand commented May 9, 2023

@qiangfu-eu could you tell me which versions of modkit and modbam2bed you're using? Also let me know if you see anything unusual in the modkit log? Also if you could run wc -l <bedMethyl> on both that would help. I expect that there will be fewer rows in the modkit output.

@qiangfu-eu
Copy link
Author

qiangfu-eu commented May 9, 2023

modbam2bed is v0.9.5 'modkit is v0.1.4. I have not yet updated to 0.1.5, as there is no real algorithmic changes. The log files are clean without any issues.

The logs for puc19 sample. Note that modbam2bed always gives this warning, although the basecalling is done with only 5mC model and there is no N_altmod (all 0) and or 'h' reported by modkit either. I also reported wc -l at the end.

> cat modbam2bed.puc19.CpG.pileup.log
WARNING: You have specified either 5mC or 5hmC as a modified base.
         Oxford Nanopore Basecallers jointly call C, 5mC, and 5hmC. If you
         wish to combine calls of these bases into a single 'modified'
         count, please use the `--combine` option. The default behaviour
         is that calls of alternative modified bases are added to the
         alternatively-modified count.Analysing: 5-methylcytosine (5mC, C>m)
Fetched pUC19, 2686 2686
Processing: pUC19:0-2686
100 %
Total time: 5.470000s

> cat puc19.CpG.pileup.log
[src/logging.rs::54][2023-05-08 10:29:59][DEBUG] command line: ../../modkit_centos7_x86_64_v0.1.4/modkit pileup -t 3 --cpg --ref neb_puc19.fa --combine-strands --filter-threshold 0.6074219 --log-filepath puc19.CpG.pileup.log PAK67493_barcode19_5mC_sorted.bam puc19.CpG.pileup.bed
[src/commands.rs::723][2023-05-08 10:29:59][DEBUG] filtering output to only CpG motifs
[src/commands.rs::725][2023-05-08 10:29:59][DEBUG] combining + and - strand counts
[src/commands.rs::839][2023-05-08 10:30:05][INFO] Done, processed 173 rows.

wc -l modbam2bed.puc19.CpG.pileup.cpg.acc.bed puc19.CpG.pileup.bed
  173 modbam2bed.puc19.CpG.pileup.cpg.acc.bed
  173 puc19.CpG.pileup.bed

The log for human sample, and row count of bed files.

> cat PAI55644_EPI9.5mC.CpG.pileup.chr1.cpg.log
WARNING: You have specified either 5mC or 5hmC as a modified base.
         Oxford Nanopore Basecallers jointly call C, 5mC, and 5hmC. If you
         wish to combine calls of these bases into a single 'modified'
         count, please use the `--combine` option. The default behaviour
         is that calls of alternative modified bases are added to the
         alternatively-modified count.Analysing: 5-methylcytosine (5mC, C>m)
Processing: chr1:0-248956422
100 %
Total time: 598.350000s

> cat PAI55644_EPI9.modkit.f066.CpG.pileup.chr1.bed.log
[src/logging.rs::54][2023-05-08 15:00:55][DEBUG] command line: ../modkit_centos7_x86_64_v0.1.4/modkit pileup -t 3 --filter-threshold 0.66 --cpg --combine-strands --region chr1 --log-filepath run_modkit.f066.CpG.chr1.log check_bam2bedmethyl/PAI55644_EPI9_Sotos_dorado_sorted.bam PAI55644_EPI9.modkit.f066.CpG.pileup.chr1.bed --ref /staging/leuven/stg_00019/research/nextflowPipeline/Resources/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
[src/commands.rs::620][2023-05-08 15:00:55][INFO] parsing region chr1
[src/commands.rs::723][2023-05-08 15:00:55][DEBUG] filtering output to only CpG motifs
[src/commands.rs::725][2023-05-08 15:00:55][DEBUG] combining + and - strand counts
[src/commands.rs::839][2023-05-08 15:10:26][INFO] Done, processed 2351885 rows.

> wc -l PAI55644_EPI9.5mC.CpG.pileup.chr1.cpg.acc.bed PAI55644_EPI9.modkit.f066.CpG.pileup.chr1.bed
  2371165 PAI55644_EPI9.5mC.CpG.pileup.chr1.cpg.acc.bed
  2351885 PAI55644_EPI9.modkit.f066.CpG.pileup.chr1.bed

I actually did not expect row count to be so different between two methods on human data, as I am focusing on CpG only and using the same BAM file.

@qiangfu-eu
Copy link
Author

qiangfu-eu commented May 9, 2023

I rechecked modbam2bed generated bed file: PAI55644_EPI9.5mC.CpG.pileup.chr1.cpg.acc.bed. I noticed there are 'nan' reported. If filter out those records, the row count of bed files becomes very similar, yet still 318 different rows remain.
The count statistics remains almost identical as before after filtering 'nan' record.

> grep -v 'nan' PAI55644_EPI9.5mC.CpG.pileup.chr1.cpg.acc.bed | wc -l
2351569

> grep -v 'nan' PAI55644_EPI9.5mC.CpG.pileup.chr1.cpg.acc.bed | awk '$4!="nan" {can+=$12; mod+=$13; oth+=$16; total+=$10; filter+=$14; nocall+=$15} END{valid=can+mod+oth; print (can) " ("(can/valid)") CpG canonical\n" (mod) " (" (mod/valid) ") 5mCpG modified\n" (oth) " (" (oth/valid) ") 5hmCpG modified\n" (valid) " valid\n"(filter) " filtered, NA diff nt, NA del, " (nocall) " nocall, " (total) " total."}' 
9896690 (0.139618) CpG canonical
60987562 (0.860382) 5mCpG modified
0 (0) 5hmCpG modified
70884252 valid
5941357 filtered, NA diff nt, NA del, 8718968 nocall, 102653022 total

@ArtRand
Copy link
Contributor

ArtRand commented May 9, 2023

@qiangfu-eu one more quick check for me if you would: run the pUC19 BAM with modkit using --no-filtering. I want to try an understand the discrepency in valid calls. The difference in the number of rows in the human sample is likely because modkit does not emit rows when all the calls are filtered (or there's zero coverage). Also please run samtools flagstat on the BAM for me. Lastly, column 4 for the modbam2bed output should be the abbreviated name of the modified base (5mC), so maybe change your awk script to $4=="5mC".

@qiangfu-eu
Copy link
Author

@ArtRand thanks for pointing my errors in the awk command. Luckily, it does NOT influence the count statistics generated. I run --no-filtering as requested. And the results is below.

> modkit pileup -t 3 --cpg --ref neb_puc19.fa --combine-strands --no-filtering --log-filepath puc19.CpG.pileup.nof.log PAK67493_barcode19_5mC_sorted.bam puc19.CpG.pileup.nof.bed

205854 (0.163093) CpG canonical
1056334 (0.836907) 5mCpG modified
0 (0) 5hmCpG modified
1262188 total valid
0 filtered, 5921 diff nt, 5647 del, 15012 nocall

There is a 3% diff on ratios, more canonicals.

Here is the flagstat of puc19 BAM. It is a mixture of lambda and puc19, hence only a small fraction of reads are of puc19.

2413864 + 0 in total (QC-passed reads + QC-failed reads)
2402913 + 0 primary
0 + 0 secondary
10951 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
150180 + 0 mapped (6.22% : N/A)
139229 + 0 primary mapped (5.79% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Maybe more interesting is samtools stats. There are only around 50% mapped bases are properly mapped (based on CIGAR), as we run a short read sequencing procotol and we cannot trim reads after base calling hence contains a lot of adapter sequences in reads. But they are softcliped, hence should not contribute to the methylation calls.

This is not the case for the human dataset I tried above.

# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN      raw total sequences:    2402913 # excluding supplementary and secondary reads
SN      filtered sequences:     0
SN      sequences:      2402913
SN      is sorted:      1
SN      1st fragments:  2402913
SN      last fragments: 0
SN      reads mapped:   139229
SN      reads mapped and paired:        0       # paired-end technology bit set + both mates mapped
SN      reads unmapped: 2263684
SN      reads properly paired:  0       # proper-pair bit set
SN      reads paired:   0       # paired-end technology bit set
SN      reads duplicated:       0       # PCR or optical duplicate bit set
SN      reads MQ0:      0       # mapped and MQ=0
SN      reads QC failed:        0
SN      non-primary alignments: 0
SN      supplementary alignments:       10951
SN      total length:   777282472       # ignores clipping
SN      total first fragment length:    777282472       # ignores clipping
SN      total last fragment length:     0       # ignores clipping
SN      bases mapped:   56932068        # ignores clipping
SN      bases mapped (cigar):   32788998        # more accurate
SN      bases trimmed:  0
SN      bases duplicated:       0
SN      mismatches:     472538  # from NM fields
SN      error rate:     1.441148e-02    # mismatches / bases mapped (cigar)
SN      average length: 323
SN      average first fragment length:  323
SN      average last fragment length:   0
SN      maximum length: 32133
SN      maximum first fragment length:  32133
SN      maximum last fragment length:   0
SN      average quality:        25.3
SN      insert size average:    0.0
SN      insert size standard deviation: 0.0
SN      inward oriented pairs:  0
SN      outward oriented pairs: 0
SN      pairs with other orientation:   0

@ArtRand
Copy link
Contributor

ArtRand commented May 10, 2023

Hello @qiangfu-eu

I've run some tests on my side directly comparing the two programs on larger modBAMs. So far I have not been able to reproduce the difference in valid calls that you're seeing. In fact, in my experiments modkit usually has a few more valid calls by about 0.1%. The difference in methylation counts and canonical counts ends up 0.7% and 0.1%, respectively. Similar to you I'm invoking modkit and modbam2bed like this:

filter_threshold="0.6074219"
modkit pileup \
  ${bam} \
  ${modkit_results_combine} \
  --threads 20 --region chr20 \
  --filter-threshold ${filter_threshold} \
  --combine-strands \
  --cpg --ref ${reference} \
  --log-filepath modkit.log

modbam2bed -t 20 -m 5mC -e --aggregate --cpg \
  -p test_modbam2bed_06fthresh \
  -f ${filter_threshold} ${reference} ${bam} > ${mb2b_results_stranded} 2> modbam2bed.log

To validate, I'm also using awk

# modbam2bed 
# n.b. that I'm using chr20 but modbam2bed will output all chroms so you need to filter on the first column
awk '($11!="nan") && ($1 == "chr20") {can+=$12; mod+=$13; oth+=$16; total+=$10; filter+=$14; nocall+=$15} END{valid=can+mod+oth; print (can) " ("(can/valid)") CpG canonical\n" (mod) " (" (mod/valid) ") 5mCpG modified\n" (oth) " (" (oth/valid) ") 5hmCpG modified\n" (valid) " valid\n"(filter) " filtered, NA diff nt, NA del, " (nocall) " nocall, " (total) " total."}' $fp

# modkit
awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10; del+=$15; filter+=$16; diff+=$17; nocall+=$18} END{print (can) " ("(can/valid)") CpG canonical\n" (mod) " (" (mod/valid) ") 5mCpG modified\n" (oth) " (" (oth/valid) ") 5hmCpG modified\n" (valid) " total valid \n" (filter) " filtered, " (diff) " diff nt, " (del) " del, " (nocall) " nocall."}' ${fp}

I did find a bug where the N-fail (a.k.a. N-filtered) calls will not be aggregated across strands when one of the strands has 0 N-valid. This doesn't affect the methylation statistics because those reference positions have 0 methylation and 0 canonical by definition. I'll be sure to fix it in this week's release.

If you would be willing to share the pUC19 data you have, I'd be keen to take a look and see where the difference is coming from.

@qiangfu-eu
Copy link
Author

qiangfu-eu commented May 10, 2023

Thanks @ArtRand for checking this in detail. I will see whether I can share directly the BAM file tomorrow. I did test on human dataset. But as shown in the second post of this session, there are also bigger differences.

@qiangfu-eu
Copy link
Author

Hello, @ArtRand. You can retrieve it at https://drive.google.com/file/d/1QGUJgNZYWj1h79SLOCzsYQhwVRHckx6B/view?usp=share_link. Let me know when you download it.

I just want to stress that due to short fragments sequencing, only half of the bases in reads can be reliably aligne to genome. That is what you will observe in this BAM file.

Many thanks.

@ArtRand
Copy link
Contributor

ArtRand commented May 11, 2023

@ArtRand Great. I got the BAM. I'm assuming from the header that the reference sequence you're using is the one from NEB pUC19 (fasta), correct? Linked from this page.

@qiangfu-eu
Copy link
Author

Exactly.

@ArtRand
Copy link
Contributor

ArtRand commented May 18, 2023

Hello @qiangfu-eu so I've spent some time working on this. I was able to reproduce the results you've shown on the pUC19 reads with modkit and with pysam. Both modkit and pysam use the pileup functionality in htslib. What I've discovered is the behavior is unpredictable with reads that map to the ends of the plasmid sequence (presumably because they "wrap around") at the depth you have (12,000x) in this BAM. I have implemented a version of modkit without leveraging the pileup API, but it will take a little longer to get fully tested. It probably won't be ready for the next release, but likely the one following.

@qiangfu-eu
Copy link
Author

qiangfu-eu commented May 22, 2023

Hi, @ArtRand, thanks for the efforts.

Indeed, the circular nature of plasmid could causing issues for the tool. So you mean that modkit and modbam2bed handles this differently that causes the observed inconsistency, since the same BAM is used for both tools? In pUC19 sample, we observer counts differences between tools, however the overall methy ratio is rather similar.

However, as given in post 2 of this discussion, a bigger difference is observed between tools when run on one of human samples. In which although more or less similar amount of valid calls (70884252 of modbam2bed vs 70884252 of modkit) are detected, the total methylation ratios are 5% different (0.139618 modbam2bed vs 0.193721 modkit ) between two tools. This is probably a more important discrepency between tools need to be checked.

Thanks again for your help and time.

@ArtRand
Copy link
Contributor

ArtRand commented May 24, 2023

@qiangfu-eu As far as the difference you're seeing in the human data, it could be a due to one or a few things. Firstly, you may be somehow seeing the same unpredictable behavior in pileup. The logic around the threshold may be just different enough, you may try --no-filtering or even better use the filter threshold recommended by modkit for both. I've performed multiple tests on human data and the two tools produce very similar results with the same input (as shown above). I'll be sure to let you know when the new pileup backend is implemented (which solves the problem you're seeing with the pUC19 sample) as it may also help with your human sample.

@qiangfu-eu
Copy link
Author

qiangfu-eu commented Jun 13, 2023

@ArtRand I tried the latest version. The differences are still there. One extra difference that I have overlooked previously is that modbam2bed always identifies more calls then modkit. There are 1933115 vs 1262188 valid+filtered calls for modbam2bed and modkit respectively. The same on human dataset, 76826908 vs 74217141 on chr1. Also modbam2bed always reports more no calls. Since the same BAM file is used and mod call is already done. They suppose to be the same. It is unclear to me why this will happen...
Did you observer this also on your dataset? Or they are much more closer?

On all my dataset, even human, the estimate threshold is not very strict, mostly around 0.66, I wonder that might also have some influence. Should one rather choose more strict threshold and consider only most reliable calls? modkit estimated value on my data is not so strict. What is your suggestion in such a case? What is the threshold that you observer on your human dataset?

@ArtRand
Copy link
Contributor

ArtRand commented Jun 15, 2023

@qiangfu-eu
Sorry about the delay in getting back to you.

I haven't updated the pileup engine yet, so I wouldn't expect the results would change with the more recent releases. I'll let you know when those changes go in. It's on the roadmap but it's a large refactor. I agree, that the counts between the two tools should be similar. I've narrowed it down to something in the way the alignments are culled when performing the pileup on the data you sent me previously, however I have not been able to reproduce this difference with any internal datasets.

What threshold does modkit sample-probs or modkit summary give you on your human data? I would expect that the 5mCG model would have an estimated threshold >0.85 (certainly higher than 0.66).

@qiangfu-eu
Copy link
Author

@ArtRand I finally made some progress on the possible culprit for the difference between modkit and modbad2bam. I never did any duplication removal on the generated BAM files as there is no PCR step involved in nanopore sequencing protocol. Also the BAM file is not cleaned by removing unmapped reads, as by default BWA will keep them while setting FLAG as 4. However, in my testing sample of puc18 and lambda, due to the extreme coverage (around 10000), the duplication is causing some issues.

I did a test today by first removing all unmapped reads (which dose have some influence on puc18 sample, as majority of the reads are from lambda hence not mapped). Then remove duplication using picard (just as a test, not mean it is the right tool). Then I run both modkit and modbad2bam. Now I get almost identical results when the same cutoff is set.

run_modkitbed_count_5mC_stats.sh modkit.CpG.pileup.acc.090.bed
7331 (0.068532) CpG canonical
99641 (0.931468) 5mCpG modified
0 (0) 5hmCpG modified
106972 total valid
59057 filtered, 632 diff nt, 924 del, 1553 nocall.

run_modbam2bed_count_5mC_stats.sh modbam2bed.CpG.pileup.090.cpg.acc.bed
6614 (0.0622465) CpG canonical
99641 (0.937754) 5mCpG modified
0 (0) 5hmCpG modified
106255 valid
59057 filtered, NA diff nt, NA del, 2270 nocall, 169138 total.

The real difference is for calling different #nocall, which can explain the observed difference in #valid calls and #canonical. While now, #filtered and #5mC calls are the same. Could you kindly check whether the way the cufoff for filtering canonical calls are set differently?

Also clearly modkit and modbad2bam handles duplicated reads differently internally... Although in general, in human samples, the duplication level should be very very low, which explains why you never observe much difference on your human dataset.

For our human dataset, I will have a look when I have more time. Although I am not sure the duplication levels there, for normal nanopore protocol, there should be almost no duplications?

@ArtRand
Copy link
Contributor

ArtRand commented Jun 19, 2023

@qiangfu-eu yes I can check as to the differences in no_call counts. Thanks for digging into this!

Would you be able to share the BAM you're using for this test? (before+after marking dupes).

@qiangfu-eu
Copy link
Author

Sure. Here are the links

I also noticed that there are still secondary alignments in the deduped BAM. Could those cause problems?

@ArtRand
Copy link
Contributor

ArtRand commented Jun 20, 2023

Secondary, duplicate, and unmapped reads shouldn't have any effect, thanks for the BAMs, I'll take a look at these and get back.

@qiangfu-eu
Copy link
Author

In our case, we never thought to check duplicate reads. As we do low-pass sequencing for Human, the duplicates is not really an issue in such a dataset. Hence, those reads, although are duplicates, are not marked as is and or removed, and remain in BAM as normally aligned reads. This then causes issue.

@ArtRand
Copy link
Contributor

ArtRand commented Jun 28, 2023

@qiangfu-eu
I think I've gotten to the bottom of this (or at least discovered the source of the difference). modbam2bed uses INT_MAX (ref) for the default max depth setting to pileup. modkit doesn't expose this option, but the default is less than that (8000 is the default in htslib). The next version of modkit (0.1.11, to be released this week), will expose this option.

@qiangfu-eu
Copy link
Author

Thanks @ArtRand for checking this. Now the only issue still remain is the cutoff for canonical call, as this remain the only difference between two tools. I did test on my cfDNA samples, resulting the same pattern. The 5mC and 5hmC calls are identical, but two tools identify different number of canoical calls. I assume that they do not handle cutoff in the same manner. If you have time, could you look into this?

Many thanks.

@ArtRand
Copy link
Contributor

ArtRand commented Jul 5, 2023

@qiangfu-eu Sorry, could you show me an example (result from the awk command) of where the 5mC/5hmC calls are the same but the canonical calls are different? Or are you refering to the differences in no_call counts? Happy to hear that some of the results are starting to make sense.

@ArtRand
Copy link
Contributor

ArtRand commented Jul 7, 2023

@qiangfu-eu release 0.1.11 has the --max-depth parameter (docs).

@qiangfu-eu
Copy link
Author

@ArtRand Thanks for the swift update.

Using the files I provided to you last time, you can reproduce the differences on canonical call. In the post just above the one with files, I demonstrated the differences on canonical call on those files.

@ArtRand
Copy link
Contributor

ArtRand commented Jul 24, 2023

@qiangfu-eu I have not forgotten about this issue, just been busy with some other stuff. I should be able to get to a conclusion this week.

@Wensu-Liu
Copy link

Hi, @ArtRand I am also trying to use modkit to replace modbam2bed, and found this question. May I ask is there any progress of this issue? As it has beed half an year past. Thanks.

@ArtRand
Copy link
Contributor

ArtRand commented Jan 25, 2024

Hello @Wensu-Liu,

The differences in modbam2bed compared to modkit have to do with the hierarchical threshold algorithm used in modkit. We've seen that this thresholding mechanism produces the best results. At this point we've moved to modkit for pileup internally and in our workflows. If you move to using modkit and see differences that cannot be explained - feel free to open an issue with an example.

@Wensu-Liu
Copy link

Thank you @ArtRand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants