Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Add concurrency and deprecate parallelism for read_parquet APIs #42849

Merged
merged 3 commits into from
Feb 7, 2024

Conversation

c21
Copy link
Contributor

@c21 c21 commented Jan 30, 2024

Why are these changes needed?

This PR is to add a new concurrency parameter for read APIs. The motivation is to allow users to control concurrency for read operator as well, other than map operators.

TODO: Add concurrency parameter for all read APIs besides read_parquet once we agree on the change. Otherwise too many documentation change needs to make.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Show resolved Hide resolved
Comment on lines +185 to +187
ds = ray.data.read_parquet(data_path, filesystem=fs, concurrency=1)
values = [s["one"] for s in ds.take()]
assert sorted(values) == [1, 2, 3, 4, 5, 6]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_parquet could launch more than one concurrent read task and this assertion would still pass. Is there any easy way to test that the concurrency is enforced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, planning to add some test similar to est_backpressure_policies.py:test_basic. But good callout.

@c21 c21 requested a review from omatthew98 as a code owner February 7, 2024 01:11
@c21 c21 changed the title [Data] Add concurrency for read APIs [Data] Add concurrency and deprecate parallelism for read_parquet APIs Feb 7, 2024
python/ray/data/read_api.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Outdated Show resolved Hide resolved
Signed-off-by: Cheng Su <[email protected]>
@raulchen
Copy link
Contributor

raulchen commented Feb 7, 2024

Just realized that we also need to update unit tests that use parallelism. but can we can do this in the next PR.

@c21
Copy link
Contributor Author

c21 commented Feb 7, 2024

Just realized that we also need to update unit tests that use parallelism. but can we can do this in the next PR.

Yes, those are okay and won't be broken. We can change in another PR.

@c21 c21 merged commit d9b296e into ray-project:master Feb 7, 2024
9 checks passed
@c21 c21 deleted the read-concurrency branch February 7, 2024 20:21
ratnopamc pushed a commit to ratnopamc/ray that referenced this pull request Feb 11, 2024
ray-project#42849)

This PR is to add a new `concurrency` parameter for read APIs. The motivation is to allow users to control concurrency for read operator as well, other than map operators.

TODO: Add `concurrency` parameter for all read APIs besides `read_parquet` once we agree on the change. Otherwise too many documentation change needs to make.

Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Ratnopam Chakrabarti <[email protected]>
tterrysun pushed a commit to tterrysun/ray that referenced this pull request Feb 14, 2024
ray-project#42849)

This PR is to add a new `concurrency` parameter for read APIs. The motivation is to allow users to control concurrency for read operator as well, other than map operators.

TODO: Add `concurrency` parameter for all read APIs besides `read_parquet` once we agree on the change. Otherwise too many documentation change needs to make.

Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: tterrysun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants