Add multi-file support to `dask_cudf.read_json` #16057

rjzamora · 2024-06-18T18:51:40Z

Description

Dask cuDF often benefits from a larger partition sizes than pandas-backed Dask DataFrame. This motivates the ability to easily "aggregate" multiple json files into each partition using dask_cudf.read_json. This PR introduces the aggregate_files argument (defaults to True) to make it easier to accomplish multi-file DataFrame partitions.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rjzamora · 2024-06-18T18:58:43Z

cc @randerzander @ayushdg

python/dask_cudf/dask_cudf/io/tests/test_json.py

python/dask_cudf/dask_cudf/io/json.py

…json

pentschev

Left a few comments, overall it looks fine to me, but I'm not very knowledgeable of the dask-cudf codebase to offer the most insightful feedback.

python/dask_cudf/dask_cudf/backends.py

python/dask_cudf/dask_cudf/io/tests/test_json.py

python/dask_cudf/dask_cudf/io/json.py

…json

pentschev

LGTM, once again I'm not very knowledgeable of dask-cudf so I'll leave it up to you @rjzamora to merge it or wait for other reviews. 🙂

rjzamora · 2024-07-15T13:26:20Z

/merge

add basic aggregate_files option

49fa45c

rjzamora added 2 - In Progress Currently a work in progress dask Dask issue improvement Improvement / enhancement to an existing function labels Jun 18, 2024

rjzamora self-assigned this Jun 18, 2024

github-actions bot added the Python Affects Python cuDF API. label Jun 18, 2024

rjzamora added the non-breaking Non-breaking change label Jun 18, 2024

Merge branch 'branch-24.08' into multi-file-json

c30583d

rjzamora commented Jun 18, 2024

View reviewed changes

python/dask_cudf/dask_cudf/io/tests/test_json.py Outdated Show resolved Hide resolved

Update python/dask_cudf/dask_cudf/io/tests/test_json.py

5393d76

rjzamora commented Jun 18, 2024

View reviewed changes

python/dask_cudf/dask_cudf/io/json.py Outdated Show resolved Hide resolved

rjzamora added 4 commits June 27, 2024 09:14

Merge branch 'branch-24.08' into multi-file-json

d1fea99

Merge remote-tracking branch 'upstream/branch-24.08' into multi-file-…

b494d3a

…json

add support for include_path_column

a3f08d9

add include_path_column support and tesst coverage

83d1f0a

rjzamora marked this pull request as ready for review July 5, 2024 19:34

rjzamora requested a review from a team as a code owner July 5, 2024 19:34

rjzamora changed the title ~~[WIP] Add multi-file support to dask_cudf.read_json~~ Add multi-file support to dask_cudf.read_json Jul 5, 2024

Merge branch 'branch-24.08' into multi-file-json

2727f66

pentschev reviewed Jul 9, 2024

View reviewed changes

python/dask_cudf/dask_cudf/backends.py Outdated Show resolved Hide resolved

python/dask_cudf/dask_cudf/io/tests/test_json.py Outdated Show resolved Hide resolved

python/dask_cudf/dask_cudf/io/json.py Outdated Show resolved Hide resolved

rjzamora added 3 commits July 9, 2024 08:29

Merge remote-tracking branch 'upstream/branch-24.08' into multi-file-…

7f3925d

…json

address code review

a87d81c

fix typo

4cee453

pentschev approved these changes Jul 9, 2024

View reviewed changes

rjzamora added 3 commits July 12, 2024 12:05

Merge branch 'branch-24.08' into multi-file-json

aa3a595

Merge branch 'branch-24.08' into multi-file-json

90f31b6

Merge branch 'branch-24.08' into multi-file-json

2797816

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Jul 15, 2024

rapids-bot bot merged commit c4ee4a7 into rapidsai:branch-24.08 Jul 15, 2024
79 checks passed

rjzamora deleted the multi-file-json branch July 15, 2024 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-file support to `dask_cudf.read_json` #16057

Add multi-file support to `dask_cudf.read_json` #16057

rjzamora commented Jun 18, 2024 •

edited

Loading

rjzamora commented Jun 18, 2024

pentschev left a comment

pentschev left a comment

rjzamora commented Jul 15, 2024

Add multi-file support to dask_cudf.read_json #16057

Add multi-file support to dask_cudf.read_json #16057

Conversation

rjzamora commented Jun 18, 2024 • edited Loading

Description

Checklist

rjzamora commented Jun 18, 2024

pentschev left a comment

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

rjzamora commented Jul 15, 2024

Add multi-file support to `dask_cudf.read_json` #16057

Add multi-file support to `dask_cudf.read_json` #16057

rjzamora commented Jun 18, 2024 •

edited

Loading