-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Obtain Global methylation level of the genome #173
Comments
Hello @OceaneMion, It depends on your biological question. Of course you can calculate the ratio however you want, a couple ways I can think of:
This command: will produce a pileup BED file, then aggregate with awk '$4=="m" {can+=$13; mod+=$12; oth+=$14; valid+=$10} \
END{print (can/valid) " CpG canonical\n" (mod/valid) " 5mCpG modified\n" (oth/valid) " 5hmCpG modified"}' ${pileup_bed} N.b. I added
import polars as pl
X_pileup = pl.read_csv(
pileup_fp,
separator="\t",
has_header=False,
dtypes={
"chrom": str,
"start": int,
"end": int,
"name": str,
"score": int,
"strand": str,
"tstart": int,
"tend": int,
"color": str,
"coverage": int,
"freq": float,
"mod": int,
"canon": int,
"other": int,
"del": int,
"fail": int,
"diff": int,
"nocall": int,
},
)
# load a dictionary of chrom names to the chrom size
genome_sizes = ...
# You'll have to decide on threshold level over which you consider a site as "methylated".
threshold = 0.8
for chrom, X_group in X_pileup.group_by("chrom"):
num_modified = X_group.filter(pl.col("freq") > threshold).shape[0]
genome_size = genome_sizes[chrom]
results.append({
"chrom": chrom, "frac_modified": num_modified / genome_size
})
X_results = pl.DataFrame(results) If I were looking for changes in methylation levels, I'd use method (1) since that calculation really gets at what you want to know: how many reads are calling 5mC vs canonical throughout the genome. Method (2) basically does the same thing, except now you need to decide when a site is "methylated" (I've written it as when a site is >80% modified). There are more clever ways to do this as well such as something like the model used in DMR. Hope this helps. |
So, would it be reasonable to say that we can also fit a beta distribution for each position in our genome and then find the "most modified bases" using a threshold i.e. 80% of the bases on that location are modified? Using the approach specified above on moderate to low coverages tends to yield to many results. I have included an example python code below. The though process is to repeat this across my genome (the only problem is my prior will always be non-informative alpha=1,beta=1) and then set a fixed threshold (50%) to find the most likely methylated positions in my sample and than start from there. import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
# Observed data
f = 0.2 # fraction of modified bases
n = 50 # total number of bases
k = f * n # number of modified bases
# Prior parameters (non-informative prior)
alpha_prior = 1
beta_prior = 1
# Posterior parameters
alpha_post = alpha_prior + k
beta_post = beta_prior + (n - k)
# Define the Beta distribution
x = np.linspace(0, 1, 100)
y_prior = beta.pdf(x, alpha_prior, beta_prior)
y_post = beta.pdf(x, alpha_post, beta_post)
# Plot the prior and posterior distributions
plt.plot(x, y_prior, color='blue', linestyle='dashed', label='Prior')
plt.plot(x, y_post, color='red', label='Posterior')
plt.title('Prior and Posterior Beta Distributions')
plt.xlabel('Proportion of Modified Bases')
plt.ylabel('Density')
plt.legend()
plt.show()
# Calculate the probability that the true proportion is greater than a threshold
threshold = 0.5
probability = 1 - beta.cdf(threshold, alpha_post, beta_post)
print(f"Probability that the true proportion is greater than {threshold}: {probability}") |
Hello @mvolar, Using a Beta distribution to model the probability of modification at each position is a reasonable way to do it, in fact, this is how the MAP-based P-value works, I've also suggested something similar in the past. The only problem with the Beta is that it will not allow you to use multiple modifications (5hmC and 5mC), so you only get a posterior probability of modification as you've shown. Maybe you could tell me what you're trying to do and I could help you more. |
Hi, basically I have cell line mouse genome with both knockout and pseudowildtype/control sequenced at a relatively low coverage (15-20ish). I have already performed DMR detection, however this is only for detection of differentially methylated regions. I would now like to find all "significantly methylated" CpG-s in both my datasets to see how genome wide methylation patterns behave (i.e. genes, introns, intragenic regions, repetitive regions etc. etc.), purely for explorative purposes. To do this I need a way to see which of my 21~ mil rows in the table are indeed methylated since the cells are highly synchronous (i.e. the probability for uniform distribution (either 0 or 1) on a single CpG is the highest) |
Hello @mvolar, Since you have two conditions, one thing you might want to try is using the segmentation function in |
Just tried that, the problem is because of the low coverage the segmentation data is very bad at recognizing the differentially methylated segments i.e. I get whole chromosomes labeled as "same" , and the output is rather small/noninformative (see attachment, this is event with fine-grained flag) |
Hi I do not want to apply any threshold the goal is to have all the potential methylated sites, so as long as the threshold is > 0, I want to consider the site as methylated So to be more precise I would like to have the number of CpG, number of 5hmC and number of 5mC at the level of my assembly and not at the level of reads |
Hello @mvolar, One thing you could try is increasing the |
@OceaneMion I believe I've answered this question in #192? |
Hi all,
I would like to know how can we obtain the global methylation % of a genome using modkit pileup.
I know that with the fraction modified I Can derive for example the % of methylated cpg, but I would like a more global view taking also in consideration the other nucleotides for the calculation
I know the calculation will look something like :
(Number of methylated CpG / Genome size )*100
Also I am applying no filtering to the modkit pileup as I want all the possible probability per site, so I don't know if a global methylation level % is possible?
Thanks in avance for your help.
The text was updated successfully, but these errors were encountered: