[Datasets] Autodetect dataset parallelism based on available resources and data size #25883

ericl · 2022-06-17T05:12:48Z

Why are these changes needed?

This PR defaults the parallelism of Dataset reads to -1. The parallelism is determined according to the following rule in this case:

The number of available CPUs is estimated. If in a placement group, the number of CPUs in the cluster is scaled by the size of the placement group compared to the cluster size. If not in a placement group, this is the number of CPUs in the cluster. If the estimated CPUs is less than 8, it is set to 8.
The parallelism is set to the estimated number of CPUs multiplied by 2.
The in-memory data size is estimated. If the parallelism would create in-memory blocks larger than the target block size (512MiB), the parallelism is increased until the blocks are < 512MiB in size.

These rules fix two common user problems:

Insufficient parallelism in a large cluster, or too much parallelism on a small cluster.
Overly large block sizes leading to OOMs when processing a single block.

TODO:

Unit tests
Docs update

Supercedes part of: #25708

…l-detect-parallelism

…detect-parallelism

ericl · 2022-07-13T03:21:20Z

test_tensors_shuffle failing

stale

c21 · 2022-07-13T10:04:33Z

python/ray/data/datasource/parquet_datasource.py

+ self._columns = columns
+ self._schema = schema
+
+ def estimate_inmemory_data_size(self) -> Optional[int]:


sorry for the late comment. but I think it's probably a bug here to rely on serialized_size, which seems to be size of file footer, not size of actual data. Crafted a quick fix in #26516 , please let me know if it makes sense or not, thanks.

… for parallelism detection (#26543) In the previous PR #25883, a subtle regression was introduced in the case where data sizes blow up significantly. For example, suppose you're reading jpeg-image files from a Dataset, which increase in size substantially on decompression. On a small-core cluster (e.g., 4 cores), you end up with 4-8 blocks of ~200MiB each when reading a 1GiB dataset. This can blow up to OOM the node when decompressed (e.g., 25x size increase). Previously the heuristic to use parallelism=200 avoids this small-node problem. This PR avoids this issue by (1) raising the min parallelism back to 200. As an optimization, we also introduce the min block size threshold, which allows using fewer blocks if the data size is really small (<100KiB per block).

… for parallelism detection (ray-project#26543) In the previous PR ray-project#25883, a subtle regression was introduced in the case where data sizes blow up significantly. For example, suppose you're reading jpeg-image files from a Dataset, which increase in size substantially on decompression. On a small-core cluster (e.g., 4 cores), you end up with 4-8 blocks of ~200MiB each when reading a 1GiB dataset. This can blow up to OOM the node when decompressed (e.g., 25x size increase). Previously the heuristic to use parallelism=200 avoids this small-node problem. This PR avoids this issue by (1) raising the min parallelism back to 200. As an optimization, we also introduce the min block size threshold, which allows using fewer blocks if the data size is really small (<100KiB per block). Signed-off-by: Xiaowei Jiang <[email protected]>

…s and data size (ray-project#25883) This PR defaults the parallelism of Dataset reads to `-1`. The parallelism is determined according to the following rule in this case: - The number of available CPUs is estimated. If in a placement group, the number of CPUs in the cluster is scaled by the size of the placement group compared to the cluster size. If not in a placement group, this is the number of CPUs in the cluster. If the estimated CPUs is less than 8, it is set to 8. - The parallelism is set to the estimated number of CPUs multiplied by 2. - The in-memory data size is estimated. If the parallelism would create in-memory blocks larger than the target block size (512MiB), the parallelism is increased until the blocks are < 512MiB in size. These rules fix two common user problems: 1. Insufficient parallelism in a large cluster, or too much parallelism on a small cluster. 2. Overly large block sizes leading to OOMs when processing a single block. TODO: - [x] Unit tests - [x] Docs update Supercedes part of: ray-project#25708 Co-authored-by: Ubuntu <[email protected]> Signed-off-by: Stefan van der Kleij <[email protected]>

… for parallelism detection (ray-project#26543) In the previous PR ray-project#25883, a subtle regression was introduced in the case where data sizes blow up significantly. For example, suppose you're reading jpeg-image files from a Dataset, which increase in size substantially on decompression. On a small-core cluster (e.g., 4 cores), you end up with 4-8 blocks of ~200MiB each when reading a 1GiB dataset. This can blow up to OOM the node when decompressed (e.g., 25x size increase). Previously the heuristic to use parallelism=200 avoids this small-node problem. This PR avoids this issue by (1) raising the min parallelism back to 200. As an optimization, we also introduce the min block size threshold, which allows using fewer blocks if the data size is really small (<100KiB per block). Signed-off-by: Stefan van der Kleij <[email protected]>

ericl added 11 commits June 16, 2022 19:31

wip

94fa31e

wip

d52cbba

wip

b813603

wip

34c5eb3

lint

7d46ef0

wip

f7fc883

lint

622d211

update

7163e0b

rename

c831447

update

5145ab4

update

d2c5a18

ericl requested review from scv119, clarkzinzow, jjyao and jianoaix as code owners June 17, 2022 05:12

ericl assigned stephanie-wang, clarkzinzow and jianoaix Jun 17, 2022

ericl added 5 commits June 16, 2022 22:18

update

7b8794c

update

55a48d4

update

ed8e8f5

Merge remote-tracking branch 'upstream/master' into detect-parallelism

510b8c6

update

d88fcd0

ericl changed the title ~~[WIP] Autodetect dataset parallelism based on available resources~~ [WIP] Autodetect dataset parallelism based on available resources and data size Jun 17, 2022

ericl added 6 commits June 17, 2022 00:46

fix from_items

dad1569

Merge remote-tracking branch 'upstream/master' into detect-parallelism

a7a6138

fix test

8dc156d

wip

af722b0

update

04e7771

fix

745105a

Ubuntu added 3 commits July 13, 2022 00:07

Merge branch 'master' of https://github.com/ray-project/ray into eric…

013a449

…l-detect-parallelism

Merge branch 'detect-parallelism' of github.com:ericl/ray into ericl-…

4556a15

…detect-parallelism

fix file read

d6d4ce2

ericl merged commit 9de1add into ray-project:master Jul 13, 2022

c21 mentioned this pull request Jul 13, 2022

[Datasets] Fix Parquet in-memory file size estimation #26516

Merged

6 tasks

c21 reviewed Jul 13, 2022

View reviewed changes

ericl mentioned this pull request Jul 13, 2022

[data] Avoid under-parallelization regressions and add better testing for parallelism detection #26543

Merged

brucez-anyscale mentioned this pull request Jul 14, 2022

Fix test dashboard flaky by catch an expected exception #26555

Merged

6 tasks

c21 mentioned this pull request Jul 18, 2022

[Datasets] Explicitly define Dataset-like APIs in DatasetPipeline class #26627

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Autodetect dataset parallelism based on available resources and data size #25883

[Datasets] Autodetect dataset parallelism based on available resources and data size #25883

ericl commented Jun 17, 2022 •

edited

Loading

ericl commented Jul 13, 2022

c21 Jul 13, 2022

[Datasets] Autodetect dataset parallelism based on available resources and data size #25883

[Datasets] Autodetect dataset parallelism based on available resources and data size #25883

Conversation

ericl commented Jun 17, 2022 • edited Loading

Why are these changes needed?

ericl commented Jul 13, 2022

c21 Jul 13, 2022

Choose a reason for hiding this comment

ericl commented Jun 17, 2022 •

edited

Loading