Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probability threshold #80

Open
NickNCL opened this issue Nov 8, 2023 · 5 comments
Open

Probability threshold #80

NickNCL opened this issue Nov 8, 2023 · 5 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@NickNCL
Copy link

NickNCL commented Nov 8, 2023

Hi,

I wanted to ask about the rationale of empirically determined filtering threshold by default? Surely it would make more sense for this to be fixed e.g. between different replicates/conditions. Would date be comparable between replicates if the filtering threshold varies?

I'm getting a warning that the selected threshold by default is low

Threshold of 0.5722656 for base A is very low. Consider increasing the filter-percentile or specifying a higher threshold.
Threshold of 0.6074219 for base C is low. Consider increasing the filter-percentile or specifying a higher threshold.

I'm presuming this is because I have generally low probability scores, maybe due to the older nanopore sequencing chemistry I'm using (V10) and possibly also the basecalling model I used (all_context modification model)

I'm hoping that high sequencing depth might, to a certain extend, compensate for a higher error rate

Is there any guidance on a generally acceptable threshold?

@marcus1487
Copy link
Contributor

I wanted to ask about the rationale of empirically determined filtering threshold by default? Surely it would make more sense for this to be fixed e.g. between different replicates/conditions. Would date be comparable between replicates if the filtering threshold varies?
For larger studies using the same model across several samples it would certainly be best to select a constant threshold to apply across samples. Data should still be comparable when filtering varies across samples, but may cause complications in some situations. The default of filtering 10% of the data simply provides a better default value across different modified base models and sample types. This is also the threshold used to report all of our modified base accuracy metrics so applying this threshold should reproduce these accuracy metrics.

For the low threshold on the older model, this makes sense. The best recommendation here would be to upgrade to the latest kits where the accuracy (and confidence in each call) is much higher. For higher accuracy from these older models you may have some success increasing the filtering threshold, but this will obviously drop some data as well.

Is there any guidance on a generally acceptable threshold?
The guidance for one sample analysis would be to use the default filtering threshold of 10% of the data. The recommendation for studies with more samples would be to estimate the threshold from all samples (or a uniform sampling over all samples/conditions) and apply this threshold across the samples. Some care may need to be taken here to avoid including a sample with drastically lower quality than the rest which would drive down the threshold. For example if one sample were completely modified (not an intended target for modified base models), you may observe globally lower modified base confidence values. It might be best to remove these types of samples for threshold determination. Essentially the recommendation for more complex samples is to explore the data and apply thresholds as best fits the desired analysis. A global recommendation in these settings is not really feasible.

@NickNCL
Copy link
Author

NickNCL commented Nov 9, 2023

Thanks for the advice

Hopefully can move on to the newer kits soon we just have a bunch of older flow cells the boss wants me to use up :)

While I understand your statement that it's not feasible to provide a global threshold (always the way with bioinformatics...), the warnings within the software about low threshold seem to indicate that certain thresholds are not advisable. This makes me think that I should be manually upping it a little bit at least

@ArtRand
Copy link
Contributor

ArtRand commented Nov 10, 2023

@NickNCL One thing you could do is run

mkdir -p ${out_dir}
modkit sample-probs ${your_bam} --hist --out-dir ${out_dir}`

Then inspect probabilities.txt and probabilities.tsv and set the threshold manually with

# in pileup
--filter-threshold A:{canonical_A_threshold}
--mod-threshold a:{6mA_threshold}
--filter-threshold C:{canonical_C_threshold}
--mod-threshold m:{5mC_threshold}

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Nov 12, 2023
@Ge0rges
Copy link

Ge0rges commented Jun 27, 2024

@ArtRand would it make sense to concatenate one's samples into a single BAM and run sample-probs assuming there are not outlying samples?

@ArtRand
Copy link
Contributor

ArtRand commented Jun 27, 2024

Hello @Ge0rges,

You can, there is a one-liner below. Do you want to know the distribution of probabilities when the samples are combined? Using a combination of sample-probs and extract are useful for exploring the base modification probabilities.

If you're trying to estimate a pass-threshold you don't combine the samples together unless you're planning on operating on them all together. The estimator in modkit will attempt to find the pass threshold such that the 10% lowest confidence calls are filtered out. Typically you do this on a per-sample basis and produce artifacts from each sample. If you combine the samples together, you can still filter out the 10% lowest confidence calls, but it is possible that you won't remove calls from each of the constituent samples evenly. Maybe this is fine, it depends on your use case.

# the sort by read name is a way to pseudo randomize them
samtools merge ${bam1} ${bam2} -o - | samtools sort -n | ${modkit} sample-probs - --hist -o probs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

4 participants