Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for modkit extract read_calls_path output? #194

Closed
billytcl opened this issue May 31, 2024 · 8 comments
Closed

API for modkit extract read_calls_path output? #194

billytcl opened this issue May 31, 2024 · 8 comments
Labels
build-available custom build produced for fix. enhancement New feature or request

Comments

@billytcl
Copy link

Is there an API to get the output of modkit extract's read_calls_path (eg, mod calls after thresholding) to get per read tables in an iterator format?

The reason why I want to do this is because I'm working with some huge datasets, and parsing on a per-read basis is really prohibitively slow (also uses a crazy amount of disk space). It would be way easier to parse it on the fly in Python by iterating per read.

Considering that it writes on a per read basis anyway, do you have a code example where I can capture the output and do all the parsing live instead of writing it to disk? I'm imaging per read to output a table, then doing something to it, then iterating to the next read.

@ArtRand
Copy link
Contributor

ArtRand commented Jun 3, 2024

Hello @billytcl,

I completely understand. Give me a few and I'll make you a build that can stream the read-calls to stdout similarly to how the "raw extract" can currently. I'm also experimenting with different output formats in an effort to speed up modkit extract. Depending on what you're doing, maybe I can add it as a filter on the output.

@ArtRand ArtRand added the enhancement New feature or request label Jun 3, 2024
@billytcl
Copy link
Author

billytcl commented Jun 3, 2024 via email

@ArtRand
Copy link
Contributor

ArtRand commented Jun 3, 2024

@billytcl by "already thresholded" you mean just emit the calls where the fail coluimn is false?

@billytcl
Copy link
Author

billytcl commented Jun 3, 2024 via email

@ArtRand
Copy link
Contributor

ArtRand commented Jun 4, 2024

Hello @billytcl,

Please find attached a build that allows the following command that will stream the "read calls" table to stdout:

$ modkit extract ${modbam} null --read-calls stdout [--pass] [--no-header]

Perhaps self-explanatory but --pass will restrict the read calls table to only "passing" (i.e. thresholded) calls and --no-header will omit the first header line (sometimes this makes parsing easier).

These changes will make it into the release later this week as well.
modkit_dev1393012_centos7_x86_64.tar.gz

@ArtRand ArtRand added the build-available custom build produced for fix. label Jun 4, 2024
@billytcl
Copy link
Author

billytcl commented Jun 4, 2024 via email

@ArtRand
Copy link
Contributor

ArtRand commented Jun 4, 2024

@billytcl

That's basically what I do, aggregate the lines together until the read_id changes. If I'm doing an expensive computation on the batch of lines, I send that batch on a channel to another thread pool to process in parallel, then send the results to an aggregator. Feel free to email me art.rand[at]nanoporetech.com.

ArtRand added a commit that referenced this issue Jun 8, 2024
[extract] Allows read calls to be streamed to stdout.

See merge request machine-learning/modkit!187
@ArtRand
Copy link
Contributor

ArtRand commented Jun 11, 2024

@billytcl this change has been released in v0.3.1rc1

@ArtRand ArtRand closed this as completed Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build-available custom build produced for fix. enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants