Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation with ground truth #177

Open
vyx-lucy-kaplun opened this issue May 1, 2024 · 2 comments
Open

Validation with ground truth #177

vyx-lucy-kaplun opened this issue May 1, 2024 · 2 comments

Comments

@vyx-lucy-kaplun
Copy link

Hi, I have a few questions regarding the use of modkit validate.

  1. Can you please demonstrate an example of a ground truth bed file required for modkit validate? If I am trying to evaluate 5mC, should I have a truth bed file with "-" (for not methylated) and "C"(for methylated) in the 4th column, or is it something else? And should I somehow account specifically for the CpG context?
1	chrom 
2	start position	
3	end position	
4	mod code	
6	strand	
  1. Is there a way to generate such ground truth bedfile using bedrapg format data as an input? How would you recommend to generate the 4th column of the bed truth for validation from the score present in the bedgraph?
chr1	600684	600686	26.923	7	19
chr1	600714	600716	38.71	12	19
  1. Do we necessarily need the strand information in the truth file?

Thank you!

@marcus1487
Copy link
Contributor

At a high level I would note that this command is intended for the validation of samples where the modified base status at reference positions is known with absolute certainty. It is possible to run this command with any type of data, but the reported metrics represent the combination of the model performance and the validity of the ground truth data. If you have partial methylation at reference positions you may be more interested in comparing different samples or technologies with a heatmap/correlation analysis, but that is not the intended purpose of this command. Answers to your specific questions are below.

Can you please demonstrate an example of a ground truth bed file required for modkit validate? If I am trying to evaluate 5mC, should I have a truth bed file with "-" (for not methylated) and "C"(for methylated) in the 4th column, or is it something else?

The mod code for modified bases is as specified in the BAM tag format. So for common modified bases this will be a single letter code (m = 5mC, h = 5hmC, etc; see BAM tag spec for full list). For other bases the mod code will be the ChEBI code for the modified base (e.g. 21839 = 4mC).

For canonical positions in the reference the - character is used.

Here is an example for two modified positions:

chr1      10      11      m       0       +
chr2      50      51      m       0       +

And for canonical positions.

chr1      10      11      -       0       +
chr2      50      51      -       0       +

And should I somehow account specifically for the CpG context?

The positions in the BED file specify the positions at which you know the modified base identity with 100% certainty for all reads covering the reference position. Any such positions can be added to the BED file. If these are CpG sites then these can be added to the ground truth BED file. You can generate several BED files if you wish to validate accuracy on different subsets of reference positions (e.g. CG, CHG, and CHH contexts).

Is there a way to generate such ground truth bedfile using bedrapg format data as an input? How would you recommend to generate the 4th column of the bed truth for validation from the score present in the bedgraph?

Generally this command is intended for positions where the modified base status at a reference position is known with absolute certainty. It is possible to set thresholds given an alternative technology (or even replicates of the same one), but in this case any differences due to biological variation will be represented as "errors" which may not accurately represent the performance of the modified base calls.

Do we necessarily need the strand information in the truth file?

Yes. Modified bases occur on a single strand not on both. Some sites show constitutive methylation on both strands within certain motifs (e.g. 5mC in CG contexts). In these cases both strands can be added to the ground truth BED file.

@vyx-lucy-kaplun
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants