Automated filter threshold set to 1 for pileup and questions about thresholding #198

kylepalos · 2024-06-05T21:57:59Z

Hi,

Thanks for the great toolkit.

I am interested in calling m6A and pseudoU on direct RNA-sequencing data.

I used Dorado v0.7.0 as follows:

dorado basecaller \
-v hac,pseU \
--min-qscore 10 \
--emit-moves \
./pod5_pass/ \
--estimate-poly-a --device "cuda:0" > v7_model/pseU.bam

dorado basecaller \
-v hac,m6A \
--min-qscore 10 \
--emit-moves \
./pod5_pass/ \
--estimate-poly-a --device "cuda:1" > v7_model/m6a.bam

Then I converted this to fastq and mapped to minimap2. The MM/ML/MN tags are retained in the BAM file.

In the past (dorado v0.5), I have run modkit loosely as follows:

modkit pileup \
-p .9 \
--motif RRACH 2 \
--ref geomel.fa \
m6a_mapped.bam m6a_pileup.bed

And this would produce a pileup file with thousands of lines that have a percent modified > 0

However, with the updated version of Dorado, every line in the pileup output has a percent modified = 0 and it seems to be because the default filter threshold gets set to 1.

Using filter threshold 1 for A.

Similarly, when trying to map all the pseudoU sites using:

modkit pileup \
-p .9 \
--motif T 0 \
--ref genome.fa \
pseU_mapped.bam pseU_pileup.bed

I get:

Using filter threshold 1 for T.

Is a -p value of .9 not sensible? My understanding is that it only keeps the 10% highest confidence Nmod counts in the pileup output. Is that correct? I am using such a high value because we have in-vitro transcribed libraries that seems to have a lot of called "modifications" which begin to decrease at higher -p values.

Similarly, based on my understanding of the two-way base modification calls manual page could I get the desired effect of top 10% highest confidence m6A calls by setting:

# m6A
--filter-threshold A:0.9 --mod-threshold m:0.9

?

Sorry for the long and basic questions, please let me know if you need any additional information.

The text was updated successfully, but these errors were encountered:

ArtRand · 2024-06-05T22:52:01Z

Hello @kylepalos,

What's likely happening is that when you switched to the "all-context" model in dorado 0.7.0 bases that have <5% probability of being modified are clamped to 0% probability of modification. These calls then become the 10% most confident ones (the modbase model will probably never predict 100% chance of modification). A couple things you could do:

Decrease the percentile. You can check what the probability distributions look like with modkit sample-probs ${modbam} --hist ${histogram_dir} --percentiles 0.1,0.8,0.85,0.9, then you can decide on a per-mod and per-base threshold to use.
Pass --modified-bases-threshold 0.0 when doing basecalling+modcalling so that none of the modification probabilities are missing. With the model update, however, it may be the case that the canonical calls are more confident than the modified ones so you'll have to do (1).

Let me know if this helps.

kylepalos · 2024-06-06T13:51:53Z

Hi @ArtRand

Thank you for your help with this.

I performed step 1 as you suggested and my thresholds for my mRNA m6A BAM file were:

The thresholds are nearly identical for this (mRNA) sample and my IVT sample.

I attached 2 probability files for mRNA and IVT where they vary quite dramatically.
mRNA_probabilities.txt
ivt_probabilities.txt

Based on the differences in probabilities but the similarity in thresholds, would the following values presumably retain reasonably confident modifications?

--filter-threshold A:0.8
--mod-thresholds a:0.99

ArtRand · 2024-06-06T14:09:55Z

Hello @kylepalos,

The probability distributions in the IVT samples are going to look "strange" compared to native mRNA because there aren't any m6A residues, so all of the calls should (and appear to be) low confidence. These distributions look like what I would expect. These settings:

--filter-threshold A:0.8
--mod-thresholds a:0.99

Look reasonable to me, you may consider --mod-thresholds a:0.98 to capture some more modification calls.

kylepalos · 2024-06-06T15:39:40Z

Excellent, thanks again for your clarification.

ArtRand added question Looking for clarification on inputs and/or outputs troubleshooting workflow and data preparation questions labels Jun 5, 2024

kylepalos closed this as completed Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated filter threshold set to 1 for pileup and questions about thresholding #198

Automated filter threshold set to 1 for pileup and questions about thresholding #198

kylepalos commented Jun 5, 2024

ArtRand commented Jun 5, 2024

kylepalos commented Jun 6, 2024

ArtRand commented Jun 6, 2024

kylepalos commented Jun 6, 2024

Automated filter threshold set to 1 for pileup and questions about thresholding #198

Automated filter threshold set to 1 for pileup and questions about thresholding #198

Comments

kylepalos commented Jun 5, 2024

ArtRand commented Jun 5, 2024

kylepalos commented Jun 6, 2024

ArtRand commented Jun 6, 2024

kylepalos commented Jun 6, 2024