Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated filter threshold set to 1 for pileup and questions about thresholding #198

Closed
kylepalos opened this issue Jun 5, 2024 · 4 comments
Labels
question Looking for clarification on inputs and/or outputs troubleshooting workflow and data preparation questions

Comments

@kylepalos
Copy link

Hi,

Thanks for the great toolkit.

I am interested in calling m6A and pseudoU on direct RNA-sequencing data.

I used Dorado v0.7.0 as follows:

dorado basecaller \
-v hac,pseU \
--min-qscore 10 \
--emit-moves \
./pod5_pass/ \
--estimate-poly-a --device "cuda:0" > v7_model/pseU.bam

dorado basecaller \
-v hac,m6A \
--min-qscore 10 \
--emit-moves \
./pod5_pass/ \
--estimate-poly-a --device "cuda:1" > v7_model/m6a.bam

Then I converted this to fastq and mapped to minimap2. The MM/ML/MN tags are retained in the BAM file.

In the past (dorado v0.5), I have run modkit loosely as follows:

modkit pileup \
-p .9 \
--motif RRACH 2 \
--ref geomel.fa \
m6a_mapped.bam m6a_pileup.bed

And this would produce a pileup file with thousands of lines that have a percent modified > 0

However, with the updated version of Dorado, every line in the pileup output has a percent modified = 0 and it seems to be because the default filter threshold gets set to 1.

Using filter threshold 1 for A.

Similarly, when trying to map all the pseudoU sites using:

modkit pileup \
-p .9 \
--motif T 0 \
--ref genome.fa \
pseU_mapped.bam pseU_pileup.bed

I get:

Using filter threshold 1 for T.

Is a -p value of .9 not sensible? My understanding is that it only keeps the 10% highest confidence Nmod counts in the pileup output. Is that correct? I am using such a high value because we have in-vitro transcribed libraries that seems to have a lot of called "modifications" which begin to decrease at higher -p values.

Similarly, based on my understanding of the two-way base modification calls manual page could I get the desired effect of top 10% highest confidence m6A calls by setting:

# m6A
--filter-threshold A:0.9 --mod-threshold m:0.9

?

Sorry for the long and basic questions, please let me know if you need any additional information.

@ArtRand
Copy link
Contributor

ArtRand commented Jun 5, 2024

Hello @kylepalos,

What's likely happening is that when you switched to the "all-context" model in dorado 0.7.0 bases that have <5% probability of being modified are clamped to 0% probability of modification. These calls then become the 10% most confident ones (the modbase model will probably never predict 100% chance of modification). A couple things you could do:

  1. Decrease the percentile. You can check what the probability distributions look like with modkit sample-probs ${modbam} --hist ${histogram_dir} --percentiles 0.1,0.8,0.85,0.9, then you can decide on a per-mod and per-base threshold to use.
  2. Pass --modified-bases-threshold 0.0 when doing basecalling+modcalling so that none of the modification probabilities are missing. With the model update, however, it may be the case that the canonical calls are more confident than the modified ones so you'll have to do (1).

Let me know if this helps.

@ArtRand ArtRand added question Looking for clarification on inputs and/or outputs troubleshooting workflow and data preparation questions labels Jun 5, 2024
@kylepalos
Copy link
Author

Hi @ArtRand

Thank you for your help with this.

I performed step 1 as you suggested and my thresholds for my mRNA m6A BAM file were:

image

The thresholds are nearly identical for this (mRNA) sample and my IVT sample.

I attached 2 probability files for mRNA and IVT where they vary quite dramatically.
mRNA_probabilities.txt
ivt_probabilities.txt

Based on the differences in probabilities but the similarity in thresholds, would the following values presumably retain reasonably confident modifications?

--filter-threshold A:0.8
--mod-thresholds a:0.99

@ArtRand
Copy link
Contributor

ArtRand commented Jun 6, 2024

Hello @kylepalos,

The probability distributions in the IVT samples are going to look "strange" compared to native mRNA because there aren't any m6A residues, so all of the calls should (and appear to be) low confidence. These distributions look like what I would expect. These settings:

--filter-threshold A:0.8
--mod-thresholds a:0.99

Look reasonable to me, you may consider --mod-thresholds a:0.98 to capture some more modification calls.

@kylepalos
Copy link
Author

Excellent, thanks again for your clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs troubleshooting workflow and data preparation questions
Projects
None yet
Development

No branches or pull requests

2 participants