API for modkit extract read_calls_path output? #194

billytcl · 2024-05-31T21:07:44Z

Is there an API to get the output of modkit extract's read_calls_path (eg, mod calls after thresholding) to get per read tables in an iterator format?

The reason why I want to do this is because I'm working with some huge datasets, and parsing on a per-read basis is really prohibitively slow (also uses a crazy amount of disk space). It would be way easier to parse it on the fly in Python by iterating per read.

Considering that it writes on a per read basis anyway, do you have a code example where I can capture the output and do all the parsing live instead of writing it to disk? I'm imaging per read to output a table, then doing something to it, then iterating to the next read.

ArtRand · 2024-06-03T15:41:13Z

Hello @billytcl,

I completely understand. Give me a few and I'll make you a build that can stream the read-calls to stdout similarly to how the "raw extract" can currently. I'm also experimenting with different output formats in an effort to speed up modkit extract. Depending on what you're doing, maybe I can add it as a filter on the output.

billytcl · 2024-06-03T16:40:38Z

Thanks! It would be useful if I can get them already thresholded!

…

On Mon, Jun 3, 2024 at 8:41 AM Arthur Rand ***@***.***> wrote: Hello @billytcl <https://github.com/billytcl>, I completely understand. Give me a few and I'll make you a build that can stream the read-calls to stdout similarly to how the "raw extract" can currently. I'm also experimenting with different output formats in an effort to speed up modkit extract. Depending on what you're doing, maybe I can add it as a filter on the output. — Reply to this email directly, view it on GitHub <#194 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPHYT5EPKZ35ERCGJUMNB3ZFSFC7AVCNFSM6AAAAABITOUZDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVGU2DENRRGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ArtRand · 2024-06-03T16:55:52Z

@billytcl by "already thresholded" you mean just emit the calls where the fail coluimn is false?

billytcl · 2024-06-03T17:10:02Z

Yes that’s right!

…

On Mon, Jun 3, 2024 at 9:56 AM Arthur Rand ***@***.***> wrote: @billytcl <https://github.com/billytcl> by "already thresholded" you mean just emit the calls where the fail coluimn is false? — Reply to this email directly, view it on GitHub <#194 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPHYTYCMCNWOXMOFSSBAQTZFSN25AVCNFSM6AAAAABITOUZDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVGY4TQNRXGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ArtRand · 2024-06-04T22:19:03Z

Hello @billytcl,

Please find attached a build that allows the following command that will stream the "read calls" table to stdout:

$ modkit extract ${modbam} null --read-calls stdout [--pass] [--no-header]

Perhaps self-explanatory but --pass will restrict the read calls table to only "passing" (i.e. thresholded) calls and --no-header will omit the first header line (sometimes this makes parsing easier).

These changes will make it into the release later this week as well.
modkit_dev1393012_centos7_x86_64.tar.gz

billytcl · 2024-06-04T22:42:00Z

Thank you! Do you have any ideas on how to effectively parse it programmatically as an iterator in Python? Perhaps I can read line by line on stdin and keep track of the read name? Or do you have a better idea? Billy

…

On Tue, Jun 4, 2024 at 3:19 PM Arthur Rand ***@***.***> wrote: Hello @billytcl <https://github.com/billytcl>, Please find attached a build that allows the following command that will stream the "read calls" table to stdout: $ modkit extract ${modbam} null --read-calls stdout [--pass] [--no-header] Perhaps self-explanatory but --pass will restrict the read calls table to only "passing" (i.e. thresholded) calls and --no-header will omit the first header line (sometimes this makes parsing easier). These changes will make it into the release later this week as well. — Reply to this email directly, view it on GitHub <#194 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPHYT5664KYYP2I6I3VKJLZFY4O3AVCNFSM6AAAAABITOUZDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBYGUYDINZVGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ArtRand · 2024-06-04T22:56:21Z

@billytcl

That's basically what I do, aggregate the lines together until the read_id changes. If I'm doing an expensive computation on the batch of lines, I send that batch on a channel to another thread pool to process in parallel, then send the results to an aggregator. Feel free to email me art.rand[at]nanoporetech.com.

[extract] Allows read calls to be streamed to stdout. See merge request machine-learning/modkit!187

ArtRand · 2024-06-11T15:54:12Z

@billytcl this change has been released in v0.3.1rc1

ArtRand added the enhancement New feature or request label Jun 3, 2024

ArtRand added the build-available custom build produced for fix. label Jun 4, 2024

ArtRand added a commit that referenced this issue Jun 8, 2024

Merge branch 'ar/gh-194' into 'master'

29e69be

[extract] Allows read calls to be streamed to stdout. See merge request machine-learning/modkit!187

ArtRand closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API for modkit extract read_calls_path output? #194

API for modkit extract read_calls_path output? #194

billytcl commented May 31, 2024

ArtRand commented Jun 3, 2024

billytcl commented Jun 3, 2024 via email

ArtRand commented Jun 3, 2024

billytcl commented Jun 3, 2024 via email

ArtRand commented Jun 4, 2024 •

edited

Loading

billytcl commented Jun 4, 2024 via email

ArtRand commented Jun 4, 2024 •

edited

Loading

ArtRand commented Jun 11, 2024

API for modkit extract read_calls_path output? #194

API for modkit extract read_calls_path output? #194

Comments

billytcl commented May 31, 2024

ArtRand commented Jun 3, 2024

billytcl commented Jun 3, 2024 via email

ArtRand commented Jun 3, 2024

billytcl commented Jun 3, 2024 via email

ArtRand commented Jun 4, 2024 • edited Loading

billytcl commented Jun 4, 2024 via email

ArtRand commented Jun 4, 2024 • edited Loading

ArtRand commented Jun 11, 2024

ArtRand commented Jun 4, 2024 •

edited

Loading

ArtRand commented Jun 4, 2024 •

edited

Loading