[WIP][RFC] Add sparse host buffer source #16252

rjzamora · 2024-07-11T15:52:24Z

Related to #15919

DISCLAIMER: This is meant to be a rough RFC to illustrate my idea for a temporary mitigation strategy for NativeFile removal. I am probably not the right person to take the libcudf changes over the line but I still threw some C++ code together for fun :)

Idea: We don't need NativeFile support to get reasonable partial-IO performance from remote storage if we only transfer the necessary byte ranges into host memory with fsspec and libcudf is able to read from the sparse <offset, byte-range> mapping. The fsspec component is covered in #16166, but that PR currently wastes a lot of host-memory by copying the necessary byte ranges into a larger "proxy" byte range that matches the size of the actual file.

@vuule @vyasr - Does a "sparse" host buffer source seem like a reasonable approach for the near term?

GregoryKimball · 2024-07-26T19:18:25Z

Thank you @rjzamora for suggesting this idea. I think a sparse host buffer data source could make sense in libcudf.

The sparse host buffer data source is meant to be an improvement over the current approach which prepares a host buffer data source for reading with the bytes partially populated and the rest uninitialized. The current approach carries a risk for data corruption if libcudf ever reads an uninitialized byte range, whether from a bug in read_parquet or an over-read in JSON/CSV byte range support. If we had a sparse host buffer data source, it could throw if data were accessed outside of valid byte range. There could also be a performance benefit and host memory pressure reduction in using a sparse host buffer data source because we could allocate smaller host buffers with less wasted space.

Would you please help us measure the benefits of a sparse host buffer data source? For this test we could compare the read throughput of a 500 MB file versus a 5 GB file with only 500 MB populated.
We might want to separate a "single range" sparse host buffer from a "multi range" host buffer.
- A sparse host buffer representing a parquet file will have a footer byte range, and then byte ranges for each column chunk (+metadata). This sparse host buffer would have multiple ranges, and we would expect read_parquet to never read outside a populated byte range.
- A sparse host buffer representing a CSV/JSON file (I believe) must be constrained to only use a single byte range, and the range should be sized to allow over-reading to find the end of the last record. If we had multiple byte range in a single sparse host buffer, I'm not sure how that would interact with byte-range reads over the sparse host buffer... (but it would probably be a nightmare). It's likely that sparse host buffer data sources for text files will have a single range anyways.

rjzamora added 2 commits July 10, 2024 14:40

jot down rough prototype for sparse_host_buffer_source class

eea980d

add create def

5669ea6

rjzamora added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function labels Jul 11, 2024

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 11, 2024

rjzamora mentioned this pull request Jul 22, 2024

[WIP] Improve pyarrow-free remote-IO performance #16166

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][RFC] Add sparse host buffer source #16252

[WIP][RFC] Add sparse host buffer source #16252

rjzamora commented Jul 11, 2024

GregoryKimball commented Jul 26, 2024 •

edited

Loading

[WIP][RFC] Add sparse host buffer source #16252

Are you sure you want to change the base?

[WIP][RFC] Add sparse host buffer source #16252

Conversation

rjzamora commented Jul 11, 2024

GregoryKimball commented Jul 26, 2024 • edited Loading

GregoryKimball commented Jul 26, 2024 •

edited

Loading