Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glommio based IO version for Parquet #5240

Closed
ozgrakkurt opened this issue Dec 24, 2023 · 8 comments
Closed

Glommio based IO version for Parquet #5240

ozgrakkurt opened this issue Dec 24, 2023 · 8 comments
Labels
development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog

Comments

@ozgrakkurt
Copy link

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I want to read parquet files with less latency/better throughput. Glommio is a thread per core library that utilizes io_uring and direct_io. Direct IO is particularly nice for my case and I image for many other parquet use cases, because my usecase doesn't benefit at all from caching the parquet files in Linux PageCache, it even suffers from it. Also I use fast nvme disks to store the parquet files so direct_io/io_uring can give a big performance boost.

https://itnext.io/modern-storage-is-plenty-fast-it-is-the-apis-that-are-bad-6a68319fbc1a
https://www.phoronix.com/news/OpenZFS-DirectIO-Performance

Describe the solution you'd like
Have a alternative io implementation for parquet similar to the async feature (it uses tokio).

It will implement glommio_reader, glommio_writer etc. similar to async_reader, async_writer etc.

Describe alternatives you've considered
Can open the file with O_DIRECT flag and just read with current async impl or sync impl but then it will crash because of unaligned buffers etc.

Just using glommio should be better since don't need to implement alignment of buffers etc. Also glommio already has io_uring thread per core architecture etc. which is very nice for building database like systems.

@ozgrakkurt ozgrakkurt added the enhancement Any new improvement worthy of a entry in the changelog label Dec 24, 2023
@tustvold
Copy link
Contributor

The async reader and writer are designed to not be tightly coupled to tokio, can they be used for this?

@ozgrakkurt
Copy link
Author

The async reader and writer are designed to not be tightly coupled to tokio, can they be used for this?

So can just add a glommio feature to object_store crate and add a struct that implements ObjectStore trait and use it with the async_reader?

@tustvold
Copy link
Contributor

I think ObjectStore would run into issues with send bounds, I was thinking you could implement AsyncFileReader in your application using glommio. We may need to tweak the trait to make the futures associated types, so they can use LocalBoxFuture in this case, but I think it should be tractable

@tustvold
Copy link
Contributor

FWIW I would be interested in any benchmarks of this, typically on modern hardware parquet decoding ends up CPU bound not IO bound when reading from local disk.

@ozgrakkurt
Copy link
Author

ozgrakkurt commented Dec 24, 2023

FWIW I would be interested in any benchmarks of this, typically on modern hardware parquet decoding ends up CPU bound not IO bound when reading from local disk.

I want to disable compression and use RLE/Dictionary encoding for many columns we have, and only have lz4 for some columns. Theoretically it should be able to decode faster than disk read, even on a NVMe. Even if can't decode so much, this will be a big latency improvement since it is not always possible to perfectly pipeline reading from disk and decoding.

Also this will remove the massive memory trashing we have with page cache, we have terabytes of data and since we use buffered io, linux caches 80-100GB of parquet data in ram constantly without any need for it. This is especially annoying in kubernetes where it thinks this is actual memory usage of the program.

I won't have super clean benchmarks since probably will modify other stuff while doing this change but I'll share them 👍

@tustvold
Copy link
Contributor

tustvold commented Dec 24, 2023

Theoretically it should be able to decode faster than disk read

Certainly decoding plain encoded primitives can approach the memory bandwidth of the machine. However, the same is not typically true of BYTE_ARRAY, where the data dependency introduced by the lengths, among other overheads, tends to get in the way. So it will likely depend on the nature of your data.

I won't have super clean benchmarks since probably will modify other stuff while doing this change but I'll share them

I'll be very interested to see any results you can share, NVMe opens a lot of cool possibilities 👍

@ozgrakkurt
Copy link
Author

@tustvold so far I got initial version of this to work. The IO itself is 2x-3x as fast as tokio/stdlib io but the decoding is the main bottleneck like you said. I suspect I can tune decoding speed a lot, we currently plain encode columns with many many repetitions and then lz4 compress them so this is not very ideal I am guessing.

@tustvold
Copy link
Contributor

I'm going to close this as a duplicate of #4631

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Mar 16, 2024
@tustvold tustvold added the development-process Related to development process of arrow-rs label Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants