Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffrent threshold from dorado models #918

Open
Macdot3 opened this issue Jun 19, 2024 · 3 comments
Open

Diffrent threshold from dorado models #918

Macdot3 opened this issue Jun 19, 2024 · 3 comments

Comments

@Macdot3
Copy link

Macdot3 commented Jun 19, 2024

Hi everyone,
I ran some samples using an updated Dorado model and I'm getting different thresholds from Modkit for the same sample. In my case, I have a lower calling threshold compared to, for example, the same sample processed with an older model and version. What could be causing this? What impact might it have on my downstream analysis, and what do you recommend I do in this case? Below, I’ve included the outputs.

with dorado v0.5.3 - model [email protected]
modkit pileup --ref rCRS_16426.fasta --cpg --combine-strands ../PCR_MT10288M_final_filtered.bam ../PCR_MT10288M.bed
> calculated chunk size: 6, interval size 100000, processing 600000 positions concurrently
> filtering to only CpG motifs
> attempting to sample 10042 reads
> Using filter threshold 0.90234375 for C.
> Done, processed 858 rows. Processed ~432 reads and skipped zero reads
with dorado v0.7.1 - model [email protected]
 modkit pileup --ref rCRS_16426.fasta --cpg --combine-strands ../PCR_MT10288M_final_filtered_new.bam ../PCR_MT10288M_new.bed
> calculated chunk size: 6, interval size 100000, processing 600000 positions concurrently
> filtering to only CpG motifs
> attempting to sample 10042 reads
> Threshold of 0.50390625 for base C is very low. Consider increasing the filter-percentile or specifying a higher threshold.
> Done, processed 856 rows. Processed ~434 reads and skipped zero reads

Thank you very much for your help.

@ArtRand
Copy link

ArtRand commented Jun 19, 2024

Hello @Macdot3,

Could you do two things for me to help diagnose this issue?

  1. Tell me the exact dorado basecalling command you used so I know the basecalling model and the modified bases model?
  2. Run modkit sample-probs ../PCR_MT10288M_final_filtered.bam --hist ./probability_histograms and send me the contents of the directory?

Thanks

@Macdot3
Copy link
Author

Macdot3 commented Jun 20, 2024

Hi @ArtRand ,
Here is the folder for you
PCR_MT10288_filter.zip. Compared to what I wrote above, the threshold is around 0.86 because I had forgotten to apply a filter with samtools.
Regarding the Dorado commands, for the file PCR_MT10288M_final_filtered, I have these:

cd ../dorado-0.5.3-linux-x64/bin
./dorado basecaller /Model/[email protected]/ /home/Nanopore/Dorado/POD5/POD5_barcode12_PCR/ --modified-bases 5mCG_5hmCG --device cpu > /home/CALLS/PCR_MT10288.bam

This was followed by alignment with dorado aligner. I subsequently ran the same sample with these new versions:

cd ../dorado-0.7.1-linux-x64/bin
./dorado basecaller /Model/[email protected]/ /home/Nanopore/Dorado/POD5/POD5_barcode12_PCR/ --modified-bases 5mCG_5hmCG --device cpu > /home/CALLS/PCR_MT10288_new_model.bam

@marcus1487
Copy link

We have been able to reproduce a similar result and are looking into this. It does appear that the v5 5mC+5hmC model produces lower confidence canonical calls than the v4.3 5mC+5hmC model, but the overall accuracy is improved with the v5 model. We will dig into this result a bit further and aim to produce a more robust modified base model in future releases. I would suggest that you can use the results with confidence as the accuracy of the v5 model is increased even if the canonical probabilities have lowered a bit.

One point that may help a bit is setting the threshold manually. We will be releasing a ground truth analysis blog post in the coming months with this information, but for the v5 model we find that a threshold of about 0.76 works best on a set of balanced C/5mC/5hmC calls. This will filter canonical calls a bit more heavily than the previous v4.3 model, but should produce more accurate results overall. Note that this threshold will change for different basecalling and modified base models. The blog post will outline how this threshold is determined and allow users to estimate a new threshold for new models/conditions.

@iiSeymour iiSeymour transferred this issue from nanoporetech/modkit Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants