Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core/data] use wait based prefetcher by default #34871

Merged
merged 1 commit into from
May 1, 2023

Conversation

scv119
Copy link
Contributor

@scv119 scv119 commented Apr 28, 2023

Why are these changes needed?

Turn on the wait based prefetcher to address the issues around the actor based prefetcher.

  • [x]: benchmark the before/after performance.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@scv119 scv119 marked this pull request as ready for review April 28, 2023 20:12
@scv119 scv119 added the do-not-merge Do not merge this PR! label Apr 28, 2023
@scv119
Copy link
Contributor Author

scv119 commented Apr 28, 2023

kicking off benchmark tests

@scv119
Copy link
Contributor Author

scv119 commented Apr 30, 2023

release test result:

  1. streaming_data_ingest_benchmark_1tb
    with this pr: success! total time 64.92455434799194
    baseline: success! total time 64.79763126373291

  2. pipelined_data_ingest_benchmark_1tb.aws
    with this pr:

(ConsumingActor pid=186, ip=10.0.58.86) ##### Overall Pipeline Time Breakdown #####                                            
(ConsumingActor pid=186, ip=10.0.58.86) * Time stalled waiting for next datastream: 7.6ms min, 1.2s max, 161.99ms mean, 4.21s total
(PipelineSplitExecutorCoordinator pid=5130) 2023-04-28 14:53:36,146     INFO streaming_executor.py:147 -- Shutting down <StreamingExecutor(Thread-30, stopped daemon 139750702831360)>.
success! total time 45.71649169921875                                                                                                   
(ConsumingActor pid=186, ip=10.0.49.88) Time to read all data 42.747634274999996 seconds [repeated 19x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(ConsumingActor pid=186, ip=10.0.49.88) P50/P95/Max batch delay (s) 0.009423945000008871 0.13930677815000647 1.6963853930000141 [repeated 19x across cluster]
(ConsumingActor pid=186, ip=10.0.49.88) Num epochs read 1 [repeated 19x across cluster]
(ConsumingActor pid=186, ip=10.0.49.88) Num batches read 1280 [repeated 19x across cluster]
(ConsumingActor pid=186, ip=10.0.58.86)  [repeated 5x across cluster]
(ConsumingActor pid=186, ip=10.0.49.88) Mean throughput 1197.73 MiB/s [repeated 19x across cluster]
(ConsumingActor pid=186, ip=10.0.49.88) Num bytes read 51200.0 MiB [repeated 18x across cluster]

baseline:

#### Overall Pipeline Time Breakdown #####                                           
(ConsumingActor pid=962, ip=10.0.44.103) * Time stalled waiting for next datastream: 7.95ms min, 1.21s max, 122.83ms mean, 3.19s total
(ConsumingActor pid=962, ip=10.0.44.103)                                                                                       
success! total time 46.14367604255676                                                                                                   
(ConsumingActor pid=4712) Time to read all data 43.25214273300003 seconds [repeated 19x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(ConsumingActor pid=4712) P50/P95/Max batch delay (s) 0.009616004500003328 0.14022980685003572 1.838354100999993 [repeated 19x across cluster]
(ConsumingActor pid=4712) Num epochs read 1 [repeated 19x across cluster]
(ConsumingActor pid=4712) Num batches read 1280 [repeated 19x across cluster]
(ConsumingActor pid=4712) Num bytes read 51200.0 MiB [repeated 19x across cluster]
(ConsumingActor pid=4712) Mean throughput 1183.76 MiB/s [repeated 19x across cluster]

@scv119 scv119 removed the do-not-merge Do not merge this PR! label Apr 30, 2023
@scv119 scv119 assigned ericl and c21 Apr 30, 2023
@ericl ericl merged commit b294bfd into ray-project:master May 1, 2023
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
Turn on the wait based prefetcher to address the issues around the actor based prefetcher.

- [x]: benchmark the before/after performance.
amogkam added a commit that referenced this pull request May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Should remove actor-based prefetcher after #30375 is addressed
4 participants