Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Remove read parallelism from Ray Data documentation #43690

Merged
merged 5 commits into from
Mar 6, 2024

Conversation

c21
Copy link
Contributor

@c21 c21 commented Mar 4, 2024

Why are these changes needed?

This PR is to remove read parallelism from Ray DAta documentation, to replace it with number of output blocks for read. The motivation is we already deprecate parallelism in favor of override_num_blocks for read APIs.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Comment on lines 297 to 298
# NOTE: This parameter is deprecated. Use `min_num_blocks` instead.
min_parallelism=DEFAULT_MIN_NUM_BLOCKS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is DataContext.min_parallelism used anywhere after these changes? If not, should we just remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataContext is developer API, so I am afraid of users may use it. So didn't remove it in the first place. Maybe let's remove it in 2.11?

Copy link
Member

@bveeramani bveeramani Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're keeping it, let's make it backwards compatible and raise a deprecation warning? With the current implementation, seems like it's a no-op and we don't tell the user that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User can set the config directly like:

DataContext.get_current().min_parallelism = ...

It seems to me there's no easy way to throw a deprecation warning here. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, added deprecation warning in _autodetect_parallelism if DataContext. min_parallelism is set to a non-default value.

# block size config takes precedence over this.
DEFAULT_MIN_PARALLELISM = 200
DEFAULT_MIN_NUM_BLOCKS = 200
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue, but these change will likely conflict with #43578. If this PR merges first, I'll update the other PR accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for heads up. If #43578 merged first, I will update as well.

@@ -259,6 +259,8 @@ def __init__(
self.op_resource_reservation_enabled = DEFAULT_ENABLE_OP_RESOURCE_RESERVATION
# The reservation ratio for ReservationOpResourceAllocator.
self.op_resource_reservation_ratio = DEFAULT_OP_RESOURCE_RESERVATION_RATIO
# Minimum number of output blocks for a dataset.
self.min_num_blocks = DEFAULT_MIN_NUM_BLOCKS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name is a bit confusing here. something like "read_op_min_num_blocks" may be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Copy link
Contributor

@scottjlee scottjlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some small nits

Howeve, you can also override the default value by setting the ``override_num_blocks``
argument. Ray Data decides internally how many read tasks to run concurrently to best
utilize the cluster, ranging from ``1...override_num_blocks`` tasks. In other words,
the higher the ``override_num_blocks``, the smaller the data blocks in the Dataset and
hence the more opportunity for parallel execution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hence the more opportunity for parallel execution.
hence more opportunities for parallel execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

You can override the default parallelism by setting the ``parallelism`` argument. For
more information on how to tune the read parallelism, see
:ref:`Advanced: Performance Tips and Tuning <data_performance_tips>`.
For more information on how to tune the number of output blocks, see
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For more information on how to tune the number of output blocks, see
For more information on how to tune the number of output blocks, see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.


Tuning read parallelism
Tuning output blocks for read
~~~~~~~~~~~~~~~~~~~~~~~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may need to add more ~ to appease lint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Comment on lines 28 to 29
The ``override_num_blocks`` parameter passed to Ray Data's :ref:`read APIs <input-output>` specifies the number of output blocks and read tasks to create.
Usually, if the read is followed by a :func:`~ray.data.Dataset.map` or :func:`~ray.data.Dataset.map_batches`, the map is fused with the read; therefore ``override_num_blocks`` also determines the number of map tasks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The ``override_num_blocks`` parameter passed to Ray Data's :ref:`read APIs <input-output>` specifies the number of output blocks and read tasks to create.
Usually, if the read is followed by a :func:`~ray.data.Dataset.map` or :func:`~ray.data.Dataset.map_batches`, the map is fused with the read; therefore ``override_num_blocks`` also determines the number of map tasks.
- The ``override_num_blocks`` parameter passed to Ray Data's :ref:`read APIs <input-output>` specifies the number of output blocks, which is equivalent to the number of read tasks to create.
- Usually, if the read is followed by a :func:`~ray.data.Dataset.map` or :func:`~ray.data.Dataset.map_batches`, the map is fused with the read; therefore ``override_num_blocks`` also determines the number of map tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

To turn off this behavior and allow the read and map operators to be fused, set ``parallelism`` manually.
For example, this code sets ``parallelism`` to equal the number of files:
To turn off this behavior and allow the read and map operators to be fused, set ``override_num_blocks`` manually.
For example, this code sets ``override_num_blocks`` to equal the number of files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, this code sets ``override_num_blocks`` to equal the number of files:
For example, this code sets the number of files equal to ``override_num_blocks``:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

"``DataContext.min_parallelism`` is deprecated in Ray 2.10. "
"Please specify ``DataContext.read_op_min_num_blocks`` instead."
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set ctx.read_op_min_num_blocks = ctx.min_parallelism so that DataContext.min_parallelism continues to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry missed that part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@c21 c21 merged commit c176117 into ray-project:master Mar 6, 2024
9 checks passed
@c21 c21 deleted the parallelism-doc branch March 6, 2024 04:38
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Jun 7, 2024
…ct#43690)

This PR is to remove read parallelism from Ray DAta documentation, to replace it with number of output blocks for read. The motivation is we already deprecate `parallelism` in favor of `override_num_blocks` for read APIs.

Signed-off-by: Cheng Su <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants