-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Remove read parallelism from Ray Data documentation #43690
Conversation
python/ray/data/context.py
Outdated
# NOTE: This parameter is deprecated. Use `min_num_blocks` instead. | ||
min_parallelism=DEFAULT_MIN_NUM_BLOCKS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is DataContext.min_parallelism
used anywhere after these changes? If not, should we just remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataContext
is developer API, so I am afraid of users may use it. So didn't remove it in the first place. Maybe let's remove it in 2.11?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're keeping it, let's make it backwards compatible and raise a deprecation warning? With the current implementation, seems like it's a no-op and we don't tell the user that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User can set the config directly like:
DataContext.get_current().min_parallelism = ...
It seems to me there's no easy way to throw a deprecation warning here. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, added deprecation warning in _autodetect_parallelism
if DataContext. min_parallelism
is set to a non-default value.
python/ray/data/context.py
Outdated
# block size config takes precedence over this. | ||
DEFAULT_MIN_PARALLELISM = 200 | ||
DEFAULT_MIN_NUM_BLOCKS = 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an issue, but these change will likely conflict with #43578. If this PR merges first, I'll update the other PR accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for heads up. If #43578 merged first, I will update as well.
python/ray/data/context.py
Outdated
@@ -259,6 +259,8 @@ def __init__( | |||
self.op_resource_reservation_enabled = DEFAULT_ENABLE_OP_RESOURCE_RESERVATION | |||
# The reservation ratio for ReservationOpResourceAllocator. | |||
self.op_resource_reservation_ratio = DEFAULT_OP_RESOURCE_RESERVATION_RATIO | |||
# Minimum number of output blocks for a dataset. | |||
self.min_num_blocks = DEFAULT_MIN_NUM_BLOCKS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this name is a bit confusing here. something like "read_op_min_num_blocks" may be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some small nits
doc/source/data/loading-data.rst
Outdated
Howeve, you can also override the default value by setting the ``override_num_blocks`` | ||
argument. Ray Data decides internally how many read tasks to run concurrently to best | ||
utilize the cluster, ranging from ``1...override_num_blocks`` tasks. In other words, | ||
the higher the ``override_num_blocks``, the smaller the data blocks in the Dataset and | ||
hence the more opportunity for parallel execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hence the more opportunity for parallel execution. | |
hence more opportunities for parallel execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
doc/source/data/loading-data.rst
Outdated
You can override the default parallelism by setting the ``parallelism`` argument. For | ||
more information on how to tune the read parallelism, see | ||
:ref:`Advanced: Performance Tips and Tuning <data_performance_tips>`. | ||
For more information on how to tune the number of output blocks, see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more information on how to tune the number of output blocks, see | |
For more information on how to tune the number of output blocks, see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
doc/source/data/performance-tips.rst
Outdated
|
||
Tuning read parallelism | ||
Tuning output blocks for read | ||
~~~~~~~~~~~~~~~~~~~~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may need to add more ~ to appease lint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
doc/source/data/performance-tips.rst
Outdated
The ``override_num_blocks`` parameter passed to Ray Data's :ref:`read APIs <input-output>` specifies the number of output blocks and read tasks to create. | ||
Usually, if the read is followed by a :func:`~ray.data.Dataset.map` or :func:`~ray.data.Dataset.map_batches`, the map is fused with the read; therefore ``override_num_blocks`` also determines the number of map tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ``override_num_blocks`` parameter passed to Ray Data's :ref:`read APIs <input-output>` specifies the number of output blocks and read tasks to create. | |
Usually, if the read is followed by a :func:`~ray.data.Dataset.map` or :func:`~ray.data.Dataset.map_batches`, the map is fused with the read; therefore ``override_num_blocks`` also determines the number of map tasks. | |
- The ``override_num_blocks`` parameter passed to Ray Data's :ref:`read APIs <input-output>` specifies the number of output blocks, which is equivalent to the number of read tasks to create. | |
- Usually, if the read is followed by a :func:`~ray.data.Dataset.map` or :func:`~ray.data.Dataset.map_batches`, the map is fused with the read; therefore ``override_num_blocks`` also determines the number of map tasks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
doc/source/data/performance-tips.rst
Outdated
To turn off this behavior and allow the read and map operators to be fused, set ``parallelism`` manually. | ||
For example, this code sets ``parallelism`` to equal the number of files: | ||
To turn off this behavior and allow the read and map operators to be fused, set ``override_num_blocks`` manually. | ||
For example, this code sets ``override_num_blocks`` to equal the number of files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, this code sets ``override_num_blocks`` to equal the number of files: | |
For example, this code sets the number of files equal to ``override_num_blocks``: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
"``DataContext.min_parallelism`` is deprecated in Ray 2.10. " | ||
"Please specify ``DataContext.read_op_min_num_blocks`` instead." | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set ctx.read_op_min_num_blocks = ctx.min_parallelism
so that DataContext.min_parallelism
continues to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry missed that part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Signed-off-by: Cheng Su <[email protected]>
…ct#43690) This PR is to remove read parallelism from Ray DAta documentation, to replace it with number of output blocks for read. The motivation is we already deprecate `parallelism` in favor of `override_num_blocks` for read APIs. Signed-off-by: Cheng Su <[email protected]>
Why are these changes needed?
This PR is to remove read parallelism from Ray DAta documentation, to replace it with number of output blocks for read. The motivation is we already deprecate
parallelism
in favor ofoverride_num_blocks
for read APIs.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.