Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clearer A and B notation #91

Open
Ge0rges opened this issue Nov 28, 2023 · 3 comments
Open

Clearer A and B notation #91

Ge0rges opened this issue Nov 28, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@Ge0rges
Copy link

Ge0rges commented Nov 28, 2023

In the output of dmr multi there are columns such as samplea_counts and sampleb_counts. Given that the command is given sample names it would be nice if the file itself contained pairings between A, B and the sample names. Currently this relies on the user ensuring they don't modify the sample name. Perhaps this is not very practical if we're following any standards though, and that would be understandable in which case perhaps a comment at the top of the file?

PS: What is the difference between the nucleotide count and Total number of base modification calls in the region, including unmodified, for sample A for the output column samplea total? Is it because some nucleotides don't pass the threshold to be confidently called as unmodified?

@ArtRand ArtRand added this to the 0.2.3 milestone Nov 28, 2023
@ArtRand ArtRand added the enhancement New feature or request label Nov 28, 2023
@ArtRand ArtRand removed this from the 0.2.3 milestone Nov 28, 2023
@ArtRand ArtRand removed the enhancement New feature or request label Nov 28, 2023
@ArtRand
Copy link
Contributor

ArtRand commented Nov 28, 2023

Hello @Ge0rges,

For human-readable information such as the sample names, the place to look is the log file. The first line will always have the command itself, so any data provenance questions can be answered (hopefully) by inspecting that. Granted, the log file is optional, so maybe there should be some guard rails around that. I've considered adding flags that would add headers to the BED output files (pileup and dmr), just to make parsing with tabular tools easier. So I'll think about adding the sample name as you suggest. On the other hand, adding the sample name changes the schema.

To your second question, I'm going to assume you mean nucleotide count is the number of bases with potential modifications (e.g. all Cs in a region). sample_a_total is the total number of calls in the region as reported by the bedMethyl. As you know, some calls are filtered out before making this table. Further, every base has some amount of coverage >= 1. So the maximum this number could be, assuming no filtering, is the nucleotide count (# C bases) multiplied by the coverage. Hope this helps,

A

@Ge0rges
Copy link
Author

Ge0rges commented Nov 28, 2023

Thank you for the prompt reply as usual. I would argue that the sample naming data should be within the file to make it more independent and less prone to errors when sharing, or coming back to the data after a while. I think log files are often thought of as temporary.

@ArtRand
Copy link
Contributor

ArtRand commented Nov 28, 2023

Hello @Ge0rges,

I tend to agree. I find myself relying on the "command" comments in VCF files and PG records in BAMs all the time. There is a lot going into the 0.2.3 release, so I may not get this feature in there - but I'll try. Thanks as always for the suggestion.

@ArtRand ArtRand added this to the 0.2.4 milestone Dec 4, 2023
@ArtRand ArtRand modified the milestones: 0.2.4, 0.2.5 Dec 22, 2023
@ArtRand ArtRand removed this from the 0.2.5 milestone Oct 10, 2024
@ArtRand ArtRand added the enhancement New feature or request label Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants