[Data] Add `override_num_blocks` to `from_pandas` and perform auto-partition #44937

bveeramani · 2024-04-23T22:43:43Z

Why are these changes needed?

A common pattern is to load a DataFrame containing file URIs with from_pandas and then loading those URIs with map_batches. If you have a single large DataFrame, the subsequent operator (e.g., for reading) won't be parallelized because from_pandas produces one input block.

To fix this issue, this PR automatically splits DataFrames into a good number of blocks, and allows the user to override the number of blocks.

Related issue number

Fixes #44893

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Balaji Veeramani <[email protected]>

…rtition (ray-project#44937) A common pattern is to load a DataFrame containing file URIs with from_pandas and then loading those URIs with map_batches. If you have a single large DataFrame, the subsequent operator (e.g., for reading) won't be parallelized because from_pandas produces one input block. To fix this issue, this PR automatically splits DataFrames into a good number of blocks, and allows the user to override the number of blocks. Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

…rtition (ray-project#44937) A common pattern is to load a DataFrame containing file URIs with from_pandas and then loading those URIs with map_batches. If you have a single large DataFrame, the subsequent operator (e.g., for reading) won't be parallelized because from_pandas produces one input block. To fix this issue, this PR automatically splits DataFrames into a good number of blocks, and allows the user to override the number of blocks. Signed-off-by: Balaji Veeramani <[email protected]>

Originally, the number of blocks outputted by from_pandas equaled the number of input DataFrames (i.e., each input DataFrame became a block). For consistency with how we treat other inputs, #44937 changed the behavior so that each output block is the target block size. This meant that you could pass in many DataFrames as input but from_pandas would only output one block. The change is problematic because many users do something like from_pandas(np.array_split(metadata, num_blocks)) to get better performance, and after #44937, the array_split is pointless. So, this PR reverts the change Signed-off-by: Balaji Veeramani <[email protected]>

Initial commit

b1e0c7b

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani requested review from ericl, scv119, c21, amogkam, scottjlee, raulchen, stephanie-wang and omatthew98 as code owners April 23, 2024 22:43

bveeramani added 2 commits April 23, 2024 15:48

Add comments

915f65d

Signed-off-by: Balaji Veeramani <[email protected]>

Update comments

a68543e

Signed-off-by: Balaji Veeramani <[email protected]>

raulchen approved these changes Apr 24, 2024

View reviewed changes

bveeramani changed the title ~~[Data] Add override_num_blocks to from_pandas~~ [Data] Add override_num_blocks to from_pandas and perform auto-partition May 3, 2024

bveeramani added 8 commits May 6, 2024 13:21

Merge branch 'master' into pandas-split

48b5466

Signed-off-by: Balaji Veeramani <[email protected]>

Fix tests

cf322c0

Signed-off-by: Balaji Veeramani <[email protected]>

Appease lint

c532e5d

Signed-off-by: Balaji Veeramani <[email protected]>

Fix tests

df2ad8f

Signed-off-by: Balaji Veeramani <[email protected]>

Fix failing test

f8caff9

Signed-off-by: Balaji Veeramani <[email protected]>

Fix test

f5c4fb4

Signed-off-by: Balaji Veeramani <[email protected]>

Fix tests

ed86274

Signed-off-by: Balaji Veeramani <[email protected]>

Merge branch 'master' into pandas-split

dac8bc7

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani enabled auto-merge (squash) May 24, 2024 18:46

github-actions bot added the go add ONLY when ready to merge, run all tests label May 24, 2024

Update tests

00dfa45

Signed-off-by: Balaji Veeramani <[email protected]>

github-actions bot disabled auto-merge May 24, 2024 21:36

Fix stuff

52481a3

Signed-off-by: Balaji Veeramani <[email protected]>

bveeramani merged commit f13d144 into ray-project:master May 25, 2024
6 checks passed

bveeramani deleted the pandas-split branch May 25, 2024 07:43

bveeramani mentioned this pull request May 25, 2024

[Data] Add override_num_blocks parameter to from_pandas #44893

Closed

bveeramani mentioned this pull request Jul 1, 2024

[Data] Prevent from_pandas from combining input blocks #46363

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add `override_num_blocks` to `from_pandas` and perform auto-partition #44937

[Data] Add `override_num_blocks` to `from_pandas` and perform auto-partition #44937

bveeramani commented Apr 23, 2024 •

edited

Loading

[Data] Add override_num_blocks to from_pandas and perform auto-partition #44937

[Data] Add override_num_blocks to from_pandas and perform auto-partition #44937

Conversation

bveeramani commented Apr 23, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

[Data] Add `override_num_blocks` to `from_pandas` and perform auto-partition #44937

[Data] Add `override_num_blocks` to `from_pandas` and perform auto-partition #44937

bveeramani commented Apr 23, 2024 •

edited

Loading