Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify input format (for Bismark output) #7

Open
bug1303 opened this issue Jan 14, 2019 · 1 comment
Open

Clarify input format (for Bismark output) #7

bug1303 opened this issue Jan 14, 2019 · 1 comment

Comments

@bug1303
Copy link

bug1303 commented Jan 14, 2019

You write in the README.md that you support the following formats:

Input Type 5 Bismark coverage2cytosine format:

//Bismark coverage2cytosine format Example: chr1 762 763 + 17 64 CG CGA

Column1: chromosome, which is a string. Column2: nucleotide/start position, an unsigned integer [0,4294967295]. Column3: strand. Column4: methylated C count, an unsigned integer in [0,4294967295]. Column5: C count, an unsigned integer in [0,4294967295]. Column6: C-context, e.g. CG, CH, CHH. Column7: C-context, e.g. CGA, CGT, etc.

Input Type 6 Bismark coverage2cytosine format:

Example: chr1 762 763 0.265625 17 76

Column1: chromosome, which is a string. Column2: nucleotide/start position, an unsigned integer in [0,4294967295]. Column3: nucleotide/end position, an unsigned integer in [0,4294967295]. Column4: methylation percentage, which is calculated by Defiant. Column5: methylated C count, an unsigned integer in [0,4294967295]. Column6: C count, an unsigned integer in [0,4294967295].

However, these don’t entirely match what is described in the bismark_methylation_extractor help:

The genome-wide cytosine methylation output file is tab-delimited in the following format:
<chromosome> <position> <strand> <count methylated> <count non-methylated> <C-context> <trinucleotide context>

and

The coverage output looks like this (tab-delimited, 1-based genomic coords; zero-based half-open coordinates available with --zero_based):
<chromosome> <start position> <end position> <methylation percentage> <count methylated> <count non-methylated>

  1. You call both "coverage2cytosine" format. The "coverage2cytosine" Bismark module can create a "genome-wide cytosine methylation output file" (which looks ALMOST like Input Type 5) from the coverage output (which looks ALMOST like your Input Type 6), but can also be created from bismark_methylation_extractor directly.

  2. In Input Type 5 example you show start and end position (and 8 columns in total), but describe below only start position and 7 columns in total. I assume it's just a typo in the example?

  3. You write the start/end position for all are in [0,4294967295], Bismark by default uses 1-based, unless --zero-based is explicitly specified, and only then it becomes half-open. So, by default it's all 1-based and start position == end position, in your example it says '762 763', so should indeed --zero-based be specified?

  4. Bismark clearly states "count methylated" and "count non-methylated" rather than "methylated C count" and "C count". "C count" sounds like total count (methylated + non-methylated). What is actually expected here?

  5. Input Type 6 "Column4: methylation percentage, which is calculated by Defiant." - Why is this calculated by Defiant? And how? Shouldn't this be input to Defiant? It is part of the Bismark coverage output. However your example... "chr1 762 763 0.265625 17 76 "
    How would you get to 0.265625? It's neither 17/76, nor 17/(76+17), depending on what you actually mean in no 4... (17/64=0.265625 , assuming the 64 that you mention in input type 5 example )
    However, from an Bismark run, I got e.g. in coverage output (test.deduplicated.bismark.cov.gz):
    chr3 3008646 3008646 33.3333333333333 1 2
    chr3 5620584 5620584 75 3 1
    So, the methylation percentage is (100*col5/(col5+col6)) and not (col5/col6)
    (Also, the start and end position are same (as stated in 3), unless --zero-based is used, but then it would not be valid input to the coverage2cytosine script.)

Please consider to provide an example call for the bismark_methylation_extractor, that will produce files of the type that defiant will read and process as expected.

Looking forward to test the program once this is clarified.

@hhg7
Copy link
Owner

hhg7 commented Jan 15, 2019

Hi @bug1303

  1. Yes, that is a typo, thanks, I've fixed it and it will be corrected in the next version.

  2. This makes no difference from Defiant's perspective

  3. "C count" means "count unmethylated" thanks for helping to clear any possible confusion. I've altered this in the help menu.

  4. "which is calculated by Defiant" that's too much detail, I shouldn't have put that in there. Defiant saves unsigned integers for counts, which I found was the best way to do the programming, instead of using double or float. The percent calculated will be the same.

How would you get to 0.265625? It's neither 17/76, nor 17/(76+17), depending on what you actually mean in no 4... (17/64=0.265625 , assuming the 64 that you mention in input type 5 example )

they were only examples, I didn't mean for them to be taken literally. However, thank you for seeing this, I've changed it.

I've attached 2 files showing how the different inputs look like

file_type5.txt
file_type6.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants