[Data] Remove read parallelism from Ray Data documentation #43690

c21 · 2024-03-04T23:01:08Z

Why are these changes needed?

This PR is to remove read parallelism from Ray DAta documentation, to replace it with number of output blocks for read. The motivation is we already deprecate parallelism in favor of override_num_blocks for read APIs.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

bveeramani · 2024-03-04T23:16:23Z

python/ray/data/context.py

+ # NOTE: This parameter is deprecated. Use `min_num_blocks` instead.
+ min_parallelism=DEFAULT_MIN_NUM_BLOCKS,


Is DataContext.min_parallelism used anywhere after these changes? If not, should we just remove it?

DataContext is developer API, so I am afraid of users may use it. So didn't remove it in the first place. Maybe let's remove it in 2.11?

If we're keeping it, let's make it backwards compatible and raise a deprecation warning? With the current implementation, seems like it's a no-op and we don't tell the user that

User can set the config directly like:

DataContext.get_current().min_parallelism = ...

It seems to me there's no easy way to throw a deprecation warning here. WDYT?

Discussed offline, added deprecation warning in _autodetect_parallelism if DataContext. min_parallelism is set to a non-default value.

bveeramani · 2024-03-04T23:17:49Z

python/ray/data/context.py

 # block size config takes precedence over this.
-DEFAULT_MIN_PARALLELISM = 200
+DEFAULT_MIN_NUM_BLOCKS = 200


Not an issue, but these change will likely conflict with #43578. If this PR merges first, I'll update the other PR accordingly

Cool, thanks for heads up. If #43578 merged first, I will update as well.

raulchen · 2024-03-04T23:23:23Z

python/ray/data/context.py

@@ -259,6 +259,8 @@ def __init__(
 self.op_resource_reservation_enabled = DEFAULT_ENABLE_OP_RESOURCE_RESERVATION
 # The reservation ratio for ReservationOpResourceAllocator.
 self.op_resource_reservation_ratio = DEFAULT_OP_RESOURCE_RESERVATION_RATIO
+ # Minimum number of output blocks for a dataset.
+ self.min_num_blocks = DEFAULT_MIN_NUM_BLOCKS


this name is a bit confusing here. something like "read_op_min_num_blocks" may be better.

scottjlee

LGTM, some small nits

scottjlee · 2024-03-04T23:07:07Z

doc/source/data/loading-data.rst

+Howeve, you can also override the default value by setting the ``override_num_blocks``
+argument. Ray Data decides internally how many read tasks to run concurrently to best
+utilize the cluster, ranging from ``1...override_num_blocks`` tasks. In other words,
+the higher the ``override_num_blocks``, the smaller the data blocks in the Dataset and
 hence the more opportunity for parallel execution.


Suggested change

hence the more opportunity for parallel execution.

hence more opportunities for parallel execution.

scottjlee · 2024-03-04T23:07:19Z

doc/source/data/loading-data.rst

-You can override the default parallelism by setting the ``parallelism`` argument. For 
-more information on how to tune the read parallelism, see 
-:ref:`Advanced: Performance Tips and Tuning <data_performance_tips>`.
+For more information on how to tune the number of output blocks, see


Suggested change

For more information on how to tune the number of output blocks, see

For more information on how to tune the number of output blocks, see

scottjlee · 2024-03-04T23:08:01Z

doc/source/data/performance-tips.rst


-Tuning read parallelism
+Tuning output blocks for read
 ~~~~~~~~~~~~~~~~~~~~~~~


may need to add more ~ to appease lint

scottjlee · 2024-03-04T23:11:47Z

doc/source/data/performance-tips.rst

+The ``override_num_blocks`` parameter passed to Ray Data's :ref:`read APIs <input-output>` specifies the number of output blocks and read tasks to create.
+Usually, if the read is followed by a :func:`~ray.data.Dataset.map` or :func:`~ray.data.Dataset.map_batches`, the map is fused with the read; therefore ``override_num_blocks`` also determines the number of map tasks.


Suggested change

The ``override_num_blocks`` parameter passed to Ray Data's :ref:`read APIs <input-output>` specifies the number of output blocks and read tasks to create.

Usually, if the read is followed by a :func:`~ray.data.Dataset.map` or :func:`~ray.data.Dataset.map_batches`, the map is fused with the read; therefore ``override_num_blocks`` also determines the number of map tasks.

- The ``override_num_blocks`` parameter passed to Ray Data's :ref:`read APIs <input-output>` specifies the number of output blocks, which is equivalent to the number of read tasks to create.

- Usually, if the read is followed by a :func:`~ray.data.Dataset.map` or :func:`~ray.data.Dataset.map_batches`, the map is fused with the read; therefore ``override_num_blocks`` also determines the number of map tasks.

scottjlee · 2024-03-04T23:27:00Z

doc/source/data/performance-tips.rst

-To turn off this behavior and allow the read and map operators to be fused, set ``parallelism`` manually.
-For example, this code sets ``parallelism`` to equal the number of files:
+To turn off this behavior and allow the read and map operators to be fused, set ``override_num_blocks`` manually.
+For example, this code sets ``override_num_blocks`` to equal the number of files:


Suggested change

For example, this code sets ``override_num_blocks`` to equal the number of files:

For example, this code sets the number of files equal to ``override_num_blocks``:

Signed-off-by: Cheng Su <[email protected]>

bveeramani · 2024-03-05T22:54:23Z

python/ray/data/_internal/util.py

+ "``DataContext.min_parallelism`` is deprecated in Ray 2.10. "
+ "Please specify ``DataContext.read_op_min_num_blocks`` instead."
+ )
+


Should we set ctx.read_op_min_num_blocks = ctx.min_parallelism so that DataContext.min_parallelism continues to work?

Yes, sorry missed that part.

Signed-off-by: Cheng Su <[email protected]>

…ct#43690) This PR is to remove read parallelism from Ray DAta documentation, to replace it with number of output blocks for read. The motivation is we already deprecate `parallelism` in favor of `override_num_blocks` for read APIs. Signed-off-by: Cheng Su <[email protected]>

c21 requested review from ericl, scv119, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners March 4, 2024 23:01

c21 assigned raulchen, bveeramani and scottjlee Mar 4, 2024

c21 added the release-blocker P0 Issue that blocks the release label Mar 4, 2024

bveeramani reviewed Mar 4, 2024

View reviewed changes

raulchen reviewed Mar 4, 2024

View reviewed changes

scottjlee approved these changes Mar 4, 2024

View reviewed changes

c21 force-pushed the parallelism-doc branch from 290ee6e to f7c19cc Compare March 5, 2024 20:12

c21 added 3 commits March 5, 2024 13:21

Remove read parallelism from Ray Data documentation

d58884d

Signed-off-by: Cheng Su <[email protected]>

Address comments

5fe7c55

Signed-off-by: Cheng Su <[email protected]>

Fix lint

4ddb055

Signed-off-by: Cheng Su <[email protected]>

c21 force-pushed the parallelism-doc branch from b2ba0e7 to 4ddb055 Compare March 5, 2024 21:22

Add warning for DataContext.min_parallelism in _autodetect_parallelism

b28a3ff

Signed-off-by: Cheng Su <[email protected]>

bveeramani approved these changes Mar 5, 2024

View reviewed changes

Set ctx.read_op_min_num_blocks if only ctx.min_parallelism is set

f391daf

Signed-off-by: Cheng Su <[email protected]>

c21 merged commit c176117 into ray-project:master Mar 6, 2024
9 checks passed

c21 deleted the parallelism-doc branch March 6, 2024 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Remove read parallelism from Ray Data documentation #43690

[Data] Remove read parallelism from Ray Data documentation #43690

c21 commented Mar 4, 2024

bveeramani Mar 4, 2024

c21 Mar 4, 2024

bveeramani Mar 4, 2024 •

edited

Loading

c21 Mar 5, 2024

c21 Mar 5, 2024

bveeramani Mar 4, 2024

c21 Mar 4, 2024

raulchen Mar 4, 2024

c21 Mar 5, 2024

scottjlee left a comment

scottjlee Mar 4, 2024

c21 Mar 5, 2024

scottjlee Mar 4, 2024

c21 Mar 5, 2024

scottjlee Mar 4, 2024

c21 Mar 5, 2024

scottjlee Mar 4, 2024

c21 Mar 5, 2024

scottjlee Mar 4, 2024

c21 Mar 5, 2024

bveeramani Mar 5, 2024

c21 Mar 5, 2024

c21 Mar 6, 2024

		# NOTE: This parameter is deprecated. Use `min_num_blocks` instead.
		min_parallelism=DEFAULT_MIN_NUM_BLOCKS,

	hence the more opportunity for parallel execution.
	hence more opportunities for parallel execution.

	For more information on how to tune the number of output blocks, see
	For more information on how to tune the number of output blocks, see

		The ``override_num_blocks`` parameter passed to Ray Data's :ref:`read APIs <input-output>` specifies the number of output blocks and read tasks to create.
		Usually, if the read is followed by a :func:`~ray.data.Dataset.map` or :func:`~ray.data.Dataset.map_batches`, the map is fused with the read; therefore ``override_num_blocks`` also determines the number of map tasks.

	For example, this code sets ``override_num_blocks`` to equal the number of files:
	For example, this code sets the number of files equal to ``override_num_blocks``:

[Data] Remove read parallelism from Ray Data documentation #43690

[Data] Remove read parallelism from Ray Data documentation #43690

Conversation

c21 commented Mar 4, 2024

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bveeramani Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottjlee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bveeramani Mar 4, 2024 •

edited

Loading