-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API for modkit extract read_calls_path output? #194
Comments
Hello @billytcl, I completely understand. Give me a few and I'll make you a build that can stream the read-calls to stdout similarly to how the "raw extract" can currently. I'm also experimenting with different output formats in an effort to speed up |
Thanks! It would be useful if I can get them already thresholded!
…On Mon, Jun 3, 2024 at 8:41 AM Arthur Rand ***@***.***> wrote:
Hello @billytcl <https://github.com/billytcl>,
I completely understand. Give me a few and I'll make you a build that can
stream the read-calls to stdout similarly to how the "raw extract" can
currently. I'm also experimenting with different output formats in an
effort to speed up modkit extract. Depending on what you're doing, maybe
I can add it as a filter on the output.
—
Reply to this email directly, view it on GitHub
<#194 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPHYT5EPKZ35ERCGJUMNB3ZFSFC7AVCNFSM6AAAAABITOUZDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVGU2DENRRGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@billytcl by "already thresholded" you mean just emit the calls where the |
Yes that’s right!
…On Mon, Jun 3, 2024 at 9:56 AM Arthur Rand ***@***.***> wrote:
@billytcl <https://github.com/billytcl> by "already thresholded" you mean
just emit the calls where the fail coluimn is false?
—
Reply to this email directly, view it on GitHub
<#194 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPHYTYCMCNWOXMOFSSBAQTZFSN25AVCNFSM6AAAAABITOUZDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVGY4TQNRXGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hello @billytcl, Please find attached a build that allows the following command that will stream the "read calls" table to stdout: $ modkit extract ${modbam} null --read-calls stdout [--pass] [--no-header] Perhaps self-explanatory but These changes will make it into the release later this week as well. |
Thank you! Do you have any ideas on how to effectively parse it
programmatically as an iterator in Python? Perhaps I can read line by line
on stdin and keep track of the read name? Or do you have a better idea?
Billy
…On Tue, Jun 4, 2024 at 3:19 PM Arthur Rand ***@***.***> wrote:
Hello @billytcl <https://github.com/billytcl>,
Please find attached a build that allows the following command that will
stream the "read calls" table to stdout:
$ modkit extract ${modbam} null --read-calls stdout [--pass] [--no-header]
Perhaps self-explanatory but --pass will restrict the read calls table to
only "passing" (i.e. thresholded) calls and --no-header will omit the
first header line (sometimes this makes parsing easier).
These changes will make it into the release later this week as well.
—
Reply to this email directly, view it on GitHub
<#194 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPHYT5664KYYP2I6I3VKJLZFY4O3AVCNFSM6AAAAABITOUZDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBYGUYDINZVGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
That's basically what I do, aggregate the lines together until the read_id changes. If I'm doing an expensive computation on the batch of lines, I send that batch on a channel to another thread pool to process in parallel, then send the results to an aggregator. Feel free to email me art.rand[at]nanoporetech.com. |
[extract] Allows read calls to be streamed to stdout. See merge request machine-learning/modkit!187
Is there an API to get the output of modkit extract's read_calls_path (eg, mod calls after thresholding) to get per read tables in an iterator format?
The reason why I want to do this is because I'm working with some huge datasets, and parsing on a per-read basis is really prohibitively slow (also uses a crazy amount of disk space). It would be way easier to parse it on the fly in Python by iterating per read.
Considering that it writes on a per read basis anyway, do you have a code example where I can capture the output and do all the parsing live instead of writing it to disk? I'm imaging per read to output a table, then doing something to it, then iterating to the next read.
The text was updated successfully, but these errors were encountered: