Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obtain Global methylation level of the genome #173

Open
OceaneMion opened this issue Apr 25, 2024 · 9 comments
Open

Obtain Global methylation level of the genome #173

OceaneMion opened this issue Apr 25, 2024 · 9 comments
Labels
question Further information is requested

Comments

@OceaneMion
Copy link

Hi all,
I would like to know how can we obtain the global methylation % of a genome using modkit pileup.
I know that with the fraction modified I Can derive for example the % of methylated cpg, but I would like a more global view taking also in consideration the other nucleotides for the calculation

I know the calculation will look something like :
(Number of methylated CpG / Genome size )*100

Also I am applying no filtering to the modkit pileup as I want all the possible probability per site, so I don't know if a global methylation level % is possible?

Thanks in avance for your help.

@ArtRand
Copy link
Contributor

ArtRand commented Apr 26, 2024

Hello @OceaneMion,

It depends on your biological question. Of course you can calculate the ratio however you want, a couple ways I can think of:

  1. Number of modified CpG calls divided by total number of CpGs (optionally ignoring 5hmC):

This command: modkit pileup $modbam $pileup_bed --combine-strands --cpg --only-tabs [--ignore h]

will produce a pileup BED file, then aggregate with awk:

awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10} \
  END{print (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified\n" (oth/valid) " 5hmCpG modified"}' ${pileup_bed}

N.b. I added --only-tabs so the polars parsing code below will work.

  1. Alternatively, if you want to know the number of methylated CpG residues as a ratio of the total genome size you can no longer use the total number of calls (i.e. reads) that report a site as methylated. To do this, you'll need a slightly more complicated script such as this python code:
import polars as pl
X_pileup = pl.read_csv(
        pileup_fp,
        separator="\t",
        has_header=False,
        dtypes={
            "chrom": str,
            "start": int,
            "end": int,
            "name": str,
            "score": int,
            "strand": str,
            "tstart": int,
            "tend": int,
            "color": str,
            "coverage": int,
            "freq": float,
            "mod": int,
            "canon": int,
            "other": int,
            "del": int,
            "fail": int,
            "diff": int,
            "nocall": int,
        },
    )

# load a dictionary of chrom names to the chrom size
genome_sizes = ...

# You'll have to decide on threshold level over which you consider a site as "methylated".
threshold = 0.8 

for chrom, X_group in X_pileup.group_by("chrom"):
    num_modified = X_group.filter(pl.col("freq") > threshold).shape[0]
    genome_size = genome_sizes[chrom]
    results.append({
        "chrom": chrom, "frac_modified": num_modified / genome_size
    })

X_results = pl.DataFrame(results)

If I were looking for changes in methylation levels, I'd use method (1) since that calculation really gets at what you want to know: how many reads are calling 5mC vs canonical throughout the genome. Method (2) basically does the same thing, except now you need to decide when a site is "methylated" (I've written it as when a site is >80% modified). There are more clever ways to do this as well such as something like the model used in DMR.

Hope this helps.

@ArtRand ArtRand added the question Further information is requested label Apr 26, 2024
@mvolar
Copy link

mvolar commented May 21, 2024

There are more clever ways to do this as well such as something like the model used in DMR.

So, would it be reasonable to say that we can also fit a beta distribution for each position in our genome and then find the "most modified bases" using a threshold i.e. 80% of the bases on that location are modified? Using the approach specified above on moderate to low coverages tends to yield to many results.

I have included an example python code below. The though process is to repeat this across my genome (the only problem is my prior will always be non-informative alpha=1,beta=1) and then set a fixed threshold (50%) to find the most likely methylated positions in my sample and than start from there.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta

# Observed data
f = 0.2  # fraction of modified bases
n = 50   # total number of bases
k = f * n  # number of modified bases

# Prior parameters (non-informative prior)
alpha_prior = 1
beta_prior = 1

# Posterior parameters
alpha_post = alpha_prior + k
beta_post = beta_prior + (n - k)

# Define the Beta distribution
x = np.linspace(0, 1, 100)
y_prior = beta.pdf(x, alpha_prior, beta_prior)
y_post = beta.pdf(x, alpha_post, beta_post)

# Plot the prior and posterior distributions
plt.plot(x, y_prior, color='blue', linestyle='dashed', label='Prior')
plt.plot(x, y_post, color='red', label='Posterior')
plt.title('Prior and Posterior Beta Distributions')
plt.xlabel('Proportion of Modified Bases')
plt.ylabel('Density')
plt.legend()
plt.show()

# Calculate the probability that the true proportion is greater than a threshold
threshold = 0.5
probability = 1 - beta.cdf(threshold, alpha_post, beta_post)
print(f"Probability that the true proportion is greater than {threshold}: {probability}")

@ArtRand
Copy link
Contributor

ArtRand commented May 22, 2024

Hello @mvolar,

Using a Beta distribution to model the probability of modification at each position is a reasonable way to do it, in fact, this is how the MAP-based P-value works, I've also suggested something similar in the past. The only problem with the Beta is that it will not allow you to use multiple modifications (5hmC and 5mC), so you only get a posterior probability of modification as you've shown.

Maybe you could tell me what you're trying to do and I could help you more.

@mvolar
Copy link

mvolar commented May 22, 2024

Hi,

basically I have cell line mouse genome with both knockout and pseudowildtype/control sequenced at a relatively low coverage (15-20ish). I have already performed DMR detection, however this is only for detection of differentially methylated regions.

I would now like to find all "significantly methylated" CpG-s in both my datasets to see how genome wide methylation patterns behave (i.e. genes, introns, intragenic regions, repetitive regions etc. etc.), purely for explorative purposes. To do this I need a way to see which of my 21~ mil rows in the table are indeed methylated since the cells are highly synchronous (i.e. the probability for uniform distribution (either 0 or 1) on a single CpG is the highest)

@ArtRand
Copy link
Contributor

ArtRand commented May 22, 2024

Hello @mvolar,

Since you have two conditions, one thing you might want to try is using the segmentation function in modkit dmr pair. From there you can get regions of similar methylation levels as well as different. The segments that are "different" should have methylation levels that are different between the two conditions, presumably one of the two conditions will have high methylation levels while the other has low levels. The segments labeled as "same" may also be interesting, you can filter down to segments where both conditions have high methylation levels by parsing column 11 or 12. Let me know what you think.

@mvolar
Copy link

mvolar commented May 23, 2024

Just tried that, the problem is because of the low coverage the segmentation data is very bad at recognizing the differentially methylated segments i.e. I get whole chromosomes labeled as "same" , and the output is rather small/noninformative (see attachment, this is event with fine-grained flag)
segments.zip

@OceaneMion
Copy link
Author

OceaneMion commented May 24, 2024

modkit pileup $modbam $pileup_bed --combine-strands --cpg --only-tabs [--ignore h]

Hi I do not want to apply any threshold the goal is to have all the potential methylated sites, so as long as the threshold is > 0, I want to consider the site as methylated
I also want to have the number of CpG instead of the genome size how to obtain that ? Not at the level of reads but on the assembly directly ?

So to be more precise I would like to have the number of CpG, number of 5hmC and number of 5mC at the level of my assembly and not at the level of reads

@ArtRand
Copy link
Contributor

ArtRand commented May 28, 2024

Hello @mvolar,

One thing you could try is increasing the --significance-factor in modkit dmr pair --segment. Increasing this number will have the effect that smaller differences (or less evidence) will be required for the model to transition to the "different" state. Perhaps obviously, the trade-off is an increase in false positives.

@ArtRand
Copy link
Contributor

ArtRand commented May 28, 2024

@OceaneMion I believe I've answered this question in #192?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants